RFC: clang-format all the things!

Proposal

We should be pro-active about running clang-format on C/C++ sources throughout the project. In particular, it should explicitly not be a problem to commit format-only patches. The ultimate goal is to be 100% clang-format clean.

Motivation

  • Formatting with clang-format is in fact our policy.
  • A consistent style across the code base makes all our code easier to deal with.
  • Since the adoption of the policy, many subprojects are close to conforming already.
  • With the introduction of a github action to run clang-format, we can easily keep from backsliding (for any commits introduced via PRs).

Status (late January 2024)

I ran clang-format on nearly all subtrees of llvm-project, looking specifically at files with the .c, .cpp, .h, .inc, and .def suffixes. I excluded test-like directories, including test and benchmarks subdirectories as well as the cross-project-tests tree, as tests often have special formatting that we don’t want to disturb. (I did not exclude unittests.) I also excluded the imported projects that I’m aware of or tripped over, meaning the third-party directory and libc/AOR_v20.02, as we have no business clang-formatting code taken verbatim from other projects.

My results are in the following table. “Small” means the file had 1-10 lines of change; “Medium” means 11-20 lines of change. “Large” is anything > 20 lines.

Dir Source files Files changed Lines changed Small Medium Large
bolt 207 54 1653 30 6 18
clang 2393 1525 147145 644 237 644
clang-tools-extra 1325 241 2263 208 18 15
compiler-rt 1207 570 18806 292 98 180
flang 694 56 354 49 3 4
libc 1700 66 233842 64 1 1
libclc 375 122 1553 94 10 18
libcxx 1104 59 274 55 3 1
libcxxabi 34 30 3888 8 2 20
libunwind 20 17 1962 3 1 13
lld 214 84 763 64 14 6
lldb 2337 865 16669 665 99 101
llvm 6666 3411 212693 1784 461 1166
mlir 1778 77 331 75 1 1
openmp 228 56 257 51 4 1
polly 853 423 125400 134 56 233
pstl 34 13 631 9 1 3
utils 3 1 1 1 0 0

I eyeballed some of the diffs, and it’s clear that there’s more to this than just running clang-format on everything. (I didn’t eyeball everything, so other interesting cases might exist.)

Some .inc/.def files are not true C/C++ sources; libc especially has files with what look like special tokens for generating the “real” source. These should probably be renamed to “.in” files, or something. If we leave those aside, libc looks to be nearly 100% clean.

The only major problem with clang-format itself, that I noticed, is that it seems not to handle trailing return types very well. There appear to be some existing issues in this area. The flang and libcxx projects in particular seemed to fall victim to this.

There are of course other quirks of clang-format that I personally would find annoying, but I’m not sure they rise to the level of “bugs.”

What about non-C/C++ code?

The other major language that I know LLVM uses and clang-format supports is TableGen. In my first pass of collecting data, I eyeballed some of the .td file changes, and some of them were more peculiar than I’d like. Indentation was off, some things got mooshed together that shouldn’t have been, and so on. My guess is that TableGen support has seen a whole lot less love than C/C++ support.

So, because this is my RFC and I can make it say what I want to, I’m excluding .td files.

Objections

This section is based on my memory of other discussions that related to this topic. Feel free to mention other objections.

Churn

When we first adopted clang-format as the pretty-printer of record, the project explicitly rejects bulk-formatting the project on the grounds that it would create a huge amount of churn, and disrupt areas that people were actively working on. This decision was consistent with the same decision made the last time we updated the Coding Standards.

I think the “churn” argument is much less valid than it was at the time. Most of the smaller, newer projects are close to 100% clean already. (Although I didn’t find any that are all the way there; this surprises me, as I had thought libcxx went through this exercise not so long ago. @ldionne ?) While we can see from the table above that clang and llvm are far from close, they are also not all that terrible.

  • clang currently is only about 1/3 clean, in terms of file count; however, after formatting all the “small” and “medium” files, it will be about 75% clean.
  • llvm is currently close to 50% clean; with “small” and “medium” files done, it will be over 80% clean.

The primary downside of formatting churn is, from what I can tell, that it “pollutes” the git history. git blame becomes more tedious because you have to move past the mechanical NFC change(s) to get to the interesting bits. I hear there’s a git trick that specifies a set of commits to ignore when doing this, though, so perhaps a big-bang change (or more than one; we are talk a lot of files here, people) might not be so disruptive.

I have to say, when doing archaeology myself, I almost always have to step past multiple prior commits to get to the change I care about, so IMO this is not as big a problem as people make out. But maybe it bothers other people more than it bothers me.

Tool versioning

clang-format is, like anything else in the project, subject to change. That means its decisions might vary over time, making the specification of “clang-format clean” less precise than might be desirable. There are a couple of ways to address this.

  1. Live with it. The CI check might not 100% agree with whatever newer version you might have run on your patch, but my guess is the differences will be rare. There is after all a way for clang-format testing to verify that it has not regressed (see clang/docs/tools/clang-formatted-files.txt and the patch that introduced it (thanks to @asb for identifying this!).

  2. Commit a version to the repo, say once per release, and use that. This means that if something does change in a desirable way, the new version won’t be available until the next release occurs, which will certainly make somebody sad.

What’s Next?

I would like to see:

  1. Agreement in principle that this is a good thing to do. I think shortly after a release branch has been cut is among the least terrible times to make a blanket change like this.

  2. Decision about whether a big-bang formatting commit is appropriate, or whether we’d rather do things in a more granular way. Per-project big bangs would also be an option.

  3. If we don’t big-bang, then agreement that code owners and other reviewers should encourage people to clang-format the files they’re working on first, before proceeding with the functional change. This is obviously much easier for people with commit access.

It would also be best, IMO, if this format-all-the-things process did not take an awfully long time; a month, or maybe two, say. In principle this has been going on for years already, and we’ve made so-so progress. Time to bite the bullet and get it done, I say!

21 Likes

I am in favour! The current state of style across the repo is annoying and now we actually have enforcement in the CI.

I would love the hear what @ldionne has to say about the experience of “big bang” formatting in libc++.

I think it’s better to do it right before a release branch is made. If you do it right after, then it makes cherry-picking fixes very difficult.

6 Likes

Thanks for bringing this up. I also think it is worth re-visiting this discussion. There are certainly multiple pros and cons here and we should be careful not to ignore the discussions from the past (which I can’t find URLs for right now, I hope someone remembers… )

  • I remember one argument being that git blame (or rather svn blame at the time, would become trickier. Though I think we should re-visit this in the light of .git-blame-ignore-revs option in git at least reducing the bad effects a bit (though I think you have to manually set things up so the file gets used).
  • This will be likely be a LOT of churn of downstream users with internal patches and I suspect that to be the biggest contentious point probably… At the very least we should be prepared with some guides / help on how to deal with downstream patches.

That said, I personally think the pros outweigh the cons here and I would personally love to see this move forward.

2 Likes

The libc++ reformatting worked really well for us. I didn’t get any report of downstreams running into major issues, and Apple’s downstream fork handled it really well too.

However, libc++ is a smaller code base than e.g. Clang, and the amount of downstream diffs present in different areas of LLVM vs libc++ may be significantly different, so I’m not certain the libc++ experience can be used to draw too many conclusions here.

Hence, I’ll have to say I am neutral regarding this – I think having a clang-formatted code base is a great state to be in, however while I know the tradeoff for libc++ was the right one (in terms of effort, disruptiveness and benefits), I don’t claim to know what that tradeoff is like for other parts of LLVM. But for libc++ at least, the experience has been positive so far.

2 Likes

My two cents on process:

  • Formatting whole projects (or rather llvm-project/xxx subdirs) at-once is probably a good level of granularity. Doing it in smaller increments will unnecessarily distribute the trouble over a longer time IMO.
  • We should enable CI enforcement in a timely manner after reformatting a subproject.
  • Can we do this with some subprojects with less activity first to gain some experience on how to do this, how to deal with in-flight PRs, how to setup CI testing for the clean subprojects, … Would also love to hear how this went for libc++
2 Likes

Regarding tool versioning: I guess another option could also be to fix on a specific version of clang-format and only move the version after an RFC, similarly to how we discuss when we decide to move forward on versions of cmake, python, c++ version used, or other similar dependencies?

Not sure if that would be the best option, just thought I’d raise that as a plausible option too.

2 Likes

As a downstream repo maintainer, that’s quite true. I personally would prefer doing the reformats in smaller chunks to make each individual conflict less painful. My primary recommendation would be to make sure the downstream patches are themselves properly formatted first.

1 Like

As a downstream repo maintainer, that’s quite true. I personally would prefer doing the reformats in smaller chunks to make each individual conflict less painful. My primary recommendation would be to make sure the downstream patches are themselves properly formatted first.

We also have a bunch of downstream patches here. Though it feels to me like this would mostly need one person to figure out the right processed/tools of reformatting a patch and then applying that to all downstream patches. And I would expect that to be most efficient if you figure it out once and then apply to all patches, rather than doing it again and again as things change piecemeal. (Though not a strong opinion either way)

1 Like

I think this would be really bad for libc++. We’re using bleeding edge features all the time, which makes us very reliant on up-to-date tooling. We’d have to write an RFC basically every release to get better formatting for new features.

2 Likes

Hmmm… I was thinking it would hard to generalize, but actually: if the patch is strictly a reformat, then the merge conflict resolution could be as simple as (a) take your downstream version of the file, (b) clang-format it.

2 Likes

I agree wholeheartedly that we should implement this!

The blame.ignoreRevsFile mechanism is enabled by default on GitHub for a file called .git-blame-ignore-revs, see Viewing a file - GitHub Docs. I don’t know that git-blame documents a canonical filename anywhere, so this seems like the best choice?

The only issue is that as @MatzeB mentions it is up to the user to enable it in their local git client. I think this is OK, since as @pogo59 mentions not every user will find the extra commits annoying, and we can include it as an optional step in our docs so those that do are aware of it:

# OPTIONAL: ignore select NFC (No Functional Change) commits in git-blame
git config blame.ignoreRevsFile .git-blame-ignore-revs

As for the downstream issue, it seem like we should be able to automate the process:

  • We can ensure we only reformat whole-files, and note these NFC commits in a “format-log” somewhere, including clang-format version info
  • Downstream users/CI can detect new changes in the format-log and repeat a process for every new entry:
    • First get the same clang-format version (hand waving the issue of how for now) and confirm it is a no-op when run over the upstream format-commit
    • Then do their usual merge/rebase based on the parent of the format-commit
    • Then run clang-format over the files that were modified in the format-commit
    • Finally do the merge/rebase for the format-commit proper

We can script this for various workflows to ease the process, and make it so even future automated fixups can be handled seamlessly.

4 Likes

Here’s a question: What version of clang-format does the pre-CI job use? If it’s not basically “the most recent” then how is it updated?

Good point. I think it would worth waiting at least until the pace of fixes to the release branch has slowed down. I don’t know if I’d want to wait all the way until June.

It just uses the latest release version (or at least did the last time someone touched the clang-format version). It’s at 17.0.1 currently.

If we assume that getting the clang-format binary when given a version is a solved problem then we could just maintain a file committed to the repo with said version:

  • Upstream CI reads the version from one file, does the format and gates reviews on it.
  • A maintenance job that cleans up after mistakes and handles changes to the version file appends to the log, including the version.
  • Downstream CI reads the version from the log.

Then, as the version we use changes everything continues to work and there is an automated path for downstream users.

Maybe the difficulty is in the “automatically get/build an executable copy of clang-format for an arbitrary CI host given a version string” part?

Regarding some of the potential churn from this - do we have any styleguide rules regarding the use of // clang-format off + // clang-format on tags in the source code? And should we consider using it in more places?

1 Like

do we have any styleguide rules regarding the use of // clang-format off + // clang-format on tags in the source code?

I think it’s fine to use it, so not really necessary to document that fact.

But I assume you have a list of situations in mind where it would be recommended to turn formatting off?

Usually the problem of “just use the latest version” is that new versions can change the formatting so in order to stay 100% clean you would somehow need to reformat every time the clang-format version changes. I don’t know if that can or should be automated, but personally I feel that documenting/setting a fixed version and manually upgrading from time to time may be the safer choice after mass-reformatting…

4 Likes

Right. These are good points. We don’t have any plan on what exactly to do with the code formatting action. It was right around the beginning of the 17 release cycle, originally running clang-format 16.0.6, and then being bumped to 17.0.1 (which was to fix some bug with code indentation), so there isn’t a lot of experience yet on how the version bumps impact the CI job for this specifically.

We will almost certainly end up using whatever version/upgrade plan that comes out of this RFC.