Our AI policy vs code of conduct and vs reality

We have a code of conduct encouraging various behaviours, especially being welcoming and patient with newcomers.

We also have a statement that AI contributions are acceptable provided the contributor understands the code before submitting for review.

These interact in an unfortunate fashion. I’m posting this after seeing Gentoo’s AI policy Gentoo AI Policy | Hacker News , specifically comments from a newcomer to llvm:

I’ve been using AI to contribute to LLVM, which has a liberal policy.

The code is of terrible quality and I am at 100+ comments on my latest PR.

The PR in question is [clang-tidy] Add portability-avoid-platform-specific-fundamental-types by jj-marr · Pull Request #146970 · llvm/llvm-project · GitHub and has indeed attracted a lot of reviewer time. I searched the commit message and comments for some indication that this was written with AI but there is none and indeed none is required.

I believe the path we are current treading has some very predictable negative outcomes. We should consider changing track, either to reject newcomers or to reject AI contributions. If we embrace both, our reviewers are going to burn out in rather short order. My preference would be to reject the AI contributions but strictly speaking I suppose either works.

The arguments for and against AI generated code are rather well hashed out already - I specifically want to call attention to the interaction with our code of conduct, which predates the new wave of generated PRs, where the combination looks like an existential threat to the compiler project.

14 Likes

Our current AI policy not being adequate has featured prominently in the last two LLVM project council meetings (cc @project-council) and @rnk has been working on revising the policy here: [docs] Strengthen our quality standards and connect AI contribution policy to it by rnk · Pull Request #154441 · llvm/llvm-project · GitHub (Though I believe this hasn’t been shared as an RFC yet.)

5 Likes

After interaction with someone on Discord I added a comment explaining AI usage a few months ago.

I also just edited the PR’s original body to disclose AI usage since that comment wasn’t clearly visible.

I’d say the main reason the code is bad quality is because I wrote it less than 3 months after graduating and with almost no experience developing compilers.

1 Like
  1. Isn’t that a good reason to write it yourself, to get more familiar with the code base?
  2. How does being a recent graduate even matter, given that the AI wrote the code and not you? Unless you studied advanced prompt science, that doesn’t seem like it should matter.
2 Likes

I think this illustrates the situation perfectly. We should welcome new contributors, but that inevitably means that their contributions will not necessarily be up to the standard that experienced contributors are (this isn’t a dig at @jjmarr, just a statement of fact). We also shouldn’t reject AI contributions simply because they are AI contributions. Any policy around AI should be focused around things like copyright issues, with everything else pointing new contributors towards making sure their code conforms to the already-documented standards we have. If a developer puts up a PR that is poor quality, because it doesn’t have a test case/uses the wrong naming standard etc they should be pointed at the docs and asked them to review and fix accordingly before sinking further effort into it. We can and should expand our docs as needed to cover anything that isn’t clear.

Regarding AI, my thought is that we should accept that AI contributions are going to be the norm in the future and adapt to that. In particular, this means the balance of reviewers to developers needs to shift in favour of more reviewers/more time spent reviewing, because the effort required to make contributions versus review them is going to shift over time. Putting it another way, imagine the code writing productivity increase came because of some other non-AI tool, e.g. a better IDE. Would we ban use of that IDE, simply because it allows developers to write code faster?

5 Likes

This is just a bad experience for everyone. Contributing to LLVM is just going to have a higher bar than most projects and we need to accept that and adjust out policies accordingly. Repeats of this type of incident will discourage new contributors and scare away code reviewers who don’t want to deal with low quality submissions. We are already very short handed wrt code reviews.

Curl banned LLM bug reports b/c they were effectively being DDOSed by them: https://mastodon.social/@LukaszOlejnik/114454277500230445 and NetBSD considers LLM generated code tainted: NetBSD Commit Guidelines

I think we don’t have to ban it but the bar is going to have to be significantly higher.

5 Likes

I’m vastly in favor of changing our AI policy to just disallow it. CONTROLLING that is going to be an honor-system, but it needs to be done anyway, it is harmful to everyone.

I am one of the largest-by-most-metric reviewers on Clang. We’ve seen a bunch of reviews, and frankly, it has resulted in us being less welcome to newbies. Historically, if I got a review that the person didn’t have a sufficient understanding of what they were doing, I’d be able to ‘hand hold’ a reasonable amount. I typically did this since:

1- It was ‘less’ often
2- the people ‘learned’ quickly and thus could participate/get better over time.
3- The mistakes were likely results of copy/paste, and it often identified issues elsehere.
4- The individuals were very respectful/receptive to the changes, and asked reasonable/productive followup/pushback questions.

HOWEVER, ones that I suspect are AI contributors fail at all of these;

1- We are getting these more often. This makes my workload (for something I don’t get paid to do!) that much greater. This means SOMETHING has to give, and usually, it is ‘reviews that are furthest from completion’, typically new contributors/AI contributions.
2- In my experience, folks using AI do a much worse job at understanding their patch, and thus a much worse job at getting better over time. The amount of effort to make these contributors start making better contributions is so much more as to not be worth it anymore.
3- we get no such benefit.
4- Thanks in part to 2, and worse, folks using AI to generate responses to me, they are much worse at asking productive followups/pushbacks.

As a result, the first group (just new contributors using their brains) are getting thrown in with the latter group (used AI), and are getting ignored/MUCH worse reviews, and thus don’t get the benefits of review. (A bit of ‘throwing out the baby with the bathwater’ unfortunately).

As the original post said; we have to do 1 of 2 things:
1- Stop accepting AI contributions like this
2- Accept that new contributors are just not going to get review bandwidth, and likely won’t get their contributions accepted.

IMO, AI has shown very little value in the FE, and new contributors writing their own code has shown LARGE value in the FE. So I think it is pretty clear which side we should go with.

13 Likes

I wrote the first draft with AI, then heavily edited it before submitting. I pushed back on a lot of issues, e.g. The check originally flagged uint32_t since the underlying type was an unsigned int. The AI also kept deleting test cases I added to try to make them pass. I invested significant time planning and architecting what I wanted before writing anything.

I also made the PR because AI keeps trying to use int in my codebase when my own coding guidelines suggest size_t or uint32_t. I’m aware AI generated code has common issues and the PR is intended to reduce that review burden.

It matters to the extent I’m unfamiliar with contributing to open-source & communicating professionally. I’m also not as good as reviewing as I otherwise would be.

I’m not writing a PR with AI and chucking it over the fence. I spent days reviewing and planning the code before I even uploaded something to GitHub.

Maybe this would be clearer if I uploaded the original draft to GitHub and did my own review in the interface before seeking review from other LLVM community members. I’ve done this in the past and will do so in the future.

That’s good to hear. I feel like your PR isn’t a good candidate to highlight the bad sites of AI code generation, we’ve seen far worse in LLVM’s github.

4 Likes

This is why I think it’s a really bad idea to do something like outright ban AI usage. It seems clear to me that @jjmarr has put quite a bit of effort in and isn’t just forwarding AI slop to us. I acknowledge that this certainly isn’t always the case, but do we really want to push away new contributors just because they started with AI-suggested code? This is going to be very much trying to push back the tide: the majority of new developers are going to come from an AI starting point and use that to help them, whether we like it or not. We have to learn to adapt, not ban it, because we’re unwilling to.

Anyway, what is the line for something considered to be AI generated? Is brace auto-completion banned in this state? What about Visual Studio’s intellicode which suggests a few lines to complete common patterns? What about experience developers who are leveraging AI to shorten the routine tasks?

As someone who almost exclusively reviews PRs rather than creating them (in my case, I’m involved in most reviews of the LLVM binutils), I am well aware that we do not have enough reviewers, which inevitably means PRs we are going to have to be selective about which PRs we’re reviewing. The judgement call should be on the initial quality and usefulness of the PR, not specifically whether it contains AI generated code or not.

As much as I don’t like the whole AI trend, it is a way to learn and optimize the routine tasks. The full ban due to quality concerns may not be the right thing. A new contributor completely not familiar with the project and without a lot of experience in the industry (say an intern) will also likely not produce a PR with a great quality. Do we ban interns too so they never get any experience due to quality concerns?

2- In my experience, folks using AI do a much worse job at understanding their patch, and thus a much worse job at getting better over time. The amount of effort to make these contributors start making better contributions is so much more as to not be worth it anymore.

4- Thanks in part to 2, and worse, folks using AI to generate responses to me, they are much worse at asking productive followups/pushbacks.

I think such a thing is already violating the current developer policy. According to LLVM Developer Policy — LLVM 22.0.0git documentation

contributors are considered responsible for their contributions. We encourage contributors to review all generated code before sending it for review to verify its correctness and to understand it so that they can answer questions during code review. Reviewing and maintaining generated code that the original contributor does not understand is not a good use of limited project resources.

It may make sense to explicitly state that answering review questions with an AI tool is not allowed. The patch is a responsibility of the author, not the AI tool.

Anyway, what is the line for something considered to be AI generated? Is brace auto-completion banned in this state? What about Visual Studio’s intellicode which suggests a few lines to complete common patterns? What about experience developers who are leveraging AI to shorten the routine tasks?

+1 to that.

I think the problem we have is people ignoring the current policy, not the AI usage itself.

1 Like

I think the best policy is to require disclosure when using AI tools. I don’t really have a problem with an experienced contributor using AI to help write the patch, because they are able to actually review the code and make sure it’s good before submitting.

For me the problem is people who are new to the project and don’t actually have the experience to follow our existing policy which says:

We encourage contributors to review all generated code before sending it for review to verify its correctness and to understand it so that they can answer questions during code review.

If we require disclosure for AI contributions, then reviewers can make their own decisions about how much effort to put into the review. Personally, I’m much more willing to put in the effort to help a new contributor who wrote something themselves, because then they can learn and grow into becoming a solid contributor. If a patch is generated with AI and then all my review comments are just fed back into an AI model, then no one is learning anything, and I’m better off just using AI to generate the patch myself.

6 Likes

I’m unsure how well disclosure works in practice. If the generated code is indistinguishable from human generated code, then there’s no real utility in disclosing it. If the code is poor quality and it’s disclosed reviewers will just disregard the PR, which is then a direct incentive to not disclose it.

2 Likes

which is then a direct incentive to not disclose it.

Is the implication that many/every contributor using AI will consistently lie to LLVM reviewers about it?

I don’t think that’s likely. I think it is more likely that in general people will disclose it if they’re asked to.

Even if I’m wrong, the cost of such a request when universally ignored seems to approach zero, so I don’t see the argument for not making the request.

When the disclosure is present, the utility is that reviewers:

  • Can self-select for code they are comfortable/competent in reviewing. I personally don’t use any LLM tools and I’d be uncomfortable reviewing a medium-to-large patch if it were primarily generated with LLMs. Some LLM enthusiast may love reviewing such a patch. I think giving reviewers agency is the right call.
  • Can do a better job while reviewing (e.g. they can look for problems more common in LLM-generated code when they know they are reviewing LLM-generated code).
  • Can more meaningfully mentor new contributors. One key aspect of reviewing is meeting the contributor where they are at while not going over their heads or inadvertently patronizing them. With the context that e.g. the bulk of the code has no author it becomes easier to just dictate things in the review, rather than try to open up a dialogue with said non-existent author.

I also think the more pernicious issue concerns the reports of contributors using LLMs to generate responses. I think we should absolutely ask someone to disclose if they are just facilitating a conversation between LLVM reviewers and an LLM, as opposed to actually engaging in the review process.

In any case, if people bold-face lie then it would be convenient to have a formal policy enabling us to just stop engaging with them.

Hey folks, sorry I haven’t found time to drive the AI policy update forward. I just pushed a local change to the PR branch that pulls the policy out of the main developer policy doc and into it’s own doc, which uses Markdown, not Sphinx. GitHub renders that, so here’s a direct link to the rendered draft. I think further edits are required, I don’t think I’ve addressed all outstanding concerns, but perhaps shipping it would be better than continuing to iterate.

Fundamentally, I think we want to have conversations about patch quality, not tools. To that end, the policy leans on this notion of “extractive contributions”, where reviewing them takes resources from the project rather than sustaining it. As Shafik points out, LLVM is pretty low-level. There is a high quality bar for contribution. It’s often really difficult to understand how all these systems fit together. I think maintainers absolutely have a right to be defensive of their time, and to quickly decline to review PRs that don’t pass cost-benefit, regardless of the tools used to produce them.

For example, typo corrections in test cases are a classic case of an extractive contribution. It’s the kind of automated, drive-by contribution that someone might try to use to pad their github stats. This is an extractive contribution: the contributor gets more out of it than the project does.

I’ve seen other reports of people using AI tools to attempt to automate the entire PR review process, and these are also clearly not a good use of project time and resources.

The point is, I think we should strive for a policy that produces the right answer for good tools, and for bad tools. The cost is that with a less clear-cut policy, the more difficult it can be to apply in practice. I think the process details, like how we label and hide extractive contributions, are going to matter a lot if we want to protect our collective attention. I may be too idealistic, but that’s what I’d like to try to do as a project.


This topic is semi-related and deserves its own thread, but with respect to PR review and onboarding new contributors, I went to OSSNA and attended the CHAOSS event (Community Health Analytics for Open Source Software), and learned a bunch about how other projects handle community health and onboarding. Some CHAOSS folks, Ildikó Vancsa, have spoken at our events in the past, and shared some of their wisdom.

It turns out that data analysis and metrics can also be applied to open source community management. We can do things like track first-time contributor PR response time, acceptance rate, etc etc. These seem like good things to track! I don’t want to overcommit, but I’ve been exploring ways we can get those insights from Grimoirelab, 8knot, LFX insights, and others. And then, as with any good data-driven project, you make a change, like creating a shared github PR search that all maintainers work from, to drive the metrics in the right direction. This is all corporate project management 101 stuff, but I don’t think it comes naturally to distributed, self-directed OSS contributor-types.

4 Likes

Our Developer Policy already has requirements regarding code quality and incremental development. Why is this separate document necessary? I would like to avoid duplicating content in multiple places. Perhaps you should reference the Developer Policy and that even AI assisted PR creation needs to follow the same rules instead of duplicating things here.

1 Like

I think being explicit here is critical. We are far from the the only project experiencing some of this pain and more and more projects are working towards becoming more explicit, latest example: Faith Ekstrand: "Mesa is working to update our contributor guide. …" - Treehouse Mastodon

Previous tools did not have the capability of generating extraordinarily verbose and seemingly “correct shaped” correspondence and code that we are now seeing with LLMs. So explicitly calling them out in the policy feels necessary but I am willing to be convinced otherwise.

5 Likes

That was my initial approach, but the PR review ended up fixating on wordsmithing refactorings of the existing policy, which I didn’t think was helpful. When it came time to draft the proposed text for closing an extractive PR, I realized I wanted a single URL to a single policy doc explaining the purpose, and then I decided a separate doc would be better. But, yes, cross-referencing and updating existing docs are important.

Perhaps you can find a way to remove specific details that are duplicated. For example, listing that it has to adhere to the coding standards and all the other criteria related to quality of the patch. This is all here already:

https://llvm.org/docs/DeveloperPolicy.html#quality

Maybe you can rephrase such that these are patterns we see from AI assisted PR/Code and things to watch out instead.

2 Likes

I only skimmed the current proposal here, but my first impression is that we’re being way too verbose, and something much simpler should do. Here’s a rough attempt:

As a project, we allow contributors to use AI tools; however, the use of such tools do not change our existing quality (link it) or other expectations for contributors. The presumption is that code generated by an LLM does not meet our quality bar, and the responsibility for refuting that assumption lies with the patch author for each individual contribution.

4 Likes