RFC: clang-format all the things!

pogo59 · January 29, 2024, 7:35pm

Proposal

We should be pro-active about running clang-format on C/C++ sources throughout the project. In particular, it should explicitly not be a problem to commit format-only patches. The ultimate goal is to be 100% clang-format clean.

Motivation

Formatting with clang-format is in fact our policy.
A consistent style across the code base makes all our code easier to deal with.
Since the adoption of the policy, many subprojects are close to conforming already.
With the introduction of a github action to run clang-format, we can easily keep from backsliding (for any commits introduced via PRs).

Status (late January 2024)

I ran clang-format on nearly all subtrees of llvm-project, looking specifically at files with the .c, .cpp, .h, .inc, and .def suffixes. I excluded test-like directories, including test and benchmarks subdirectories as well as the cross-project-tests tree, as tests often have special formatting that we don’t want to disturb. (I did not exclude unittests.) I also excluded the imported projects that I’m aware of or tripped over, meaning the third-party directory and libc/AOR_v20.02, as we have no business clang-formatting code taken verbatim from other projects.

My results are in the following table. “Small” means the file had 1-10 lines of change; “Medium” means 11-20 lines of change. “Large” is anything > 20 lines.

Dir	Source files	Files changed	Lines changed	Small	Medium	Large
bolt	207	54	1653	30	6	18
clang	2393	1525	147145	644	237	644
clang-tools-extra	1325	241	2263	208	18	15
compiler-rt	1207	570	18806	292	98	180
flang	694	56	354	49	3	4
libc	1700	66	233842	64	1	1
libclc	375	122	1553	94	10	18
libcxx	1104	59	274	55	3	1
libcxxabi	34	30	3888	8	2	20
libunwind	20	17	1962	3	1	13
lld	214	84	763	64	14	6
lldb	2337	865	16669	665	99	101
llvm	6666	3411	212693	1784	461	1166
mlir	1778	77	331	75	1	1
openmp	228	56	257	51	4	1
polly	853	423	125400	134	56	233
pstl	34	13	631	9	1	3
utils	3	1	1	1	0	0

I eyeballed some of the diffs, and it’s clear that there’s more to this than just running clang-format on everything. (I didn’t eyeball everything, so other interesting cases might exist.)

Some .inc/.def files are not true C/C++ sources; libc especially has files with what look like special tokens for generating the “real” source. These should probably be renamed to “.in” files, or something. If we leave those aside, libc looks to be nearly 100% clean.

The only major problem with clang-format itself, that I noticed, is that it seems not to handle trailing return types very well. There appear to be some existing issues in this area. The flang and libcxx projects in particular seemed to fall victim to this.

There are of course other quirks of clang-format that I personally would find annoying, but I’m not sure they rise to the level of “bugs.”

What about non-C/C++ code?

The other major language that I know LLVM uses and clang-format supports is TableGen. In my first pass of collecting data, I eyeballed some of the .td file changes, and some of them were more peculiar than I’d like. Indentation was off, some things got mooshed together that shouldn’t have been, and so on. My guess is that TableGen support has seen a whole lot less love than C/C++ support.

So, because this is my RFC and I can make it say what I want to, I’m excluding .td files.

Objections

This section is based on my memory of other discussions that related to this topic. Feel free to mention other objections.

Churn

When we first adopted clang-format as the pretty-printer of record, the project explicitly rejects bulk-formatting the project on the grounds that it would create a huge amount of churn, and disrupt areas that people were actively working on. This decision was consistent with the same decision made the last time we updated the Coding Standards.

I think the “churn” argument is much less valid than it was at the time. Most of the smaller, newer projects are close to 100% clean already. (Although I didn’t find any that are all the way there; this surprises me, as I had thought libcxx went through this exercise not so long ago. @ldionne ?) While we can see from the table above that clang and llvm are far from close, they are also not all that terrible.

clang currently is only about 1/3 clean, in terms of file count; however, after formatting all the “small” and “medium” files, it will be about 75% clean.
llvm is currently close to 50% clean; with “small” and “medium” files done, it will be over 80% clean.

The primary downside of formatting churn is, from what I can tell, that it “pollutes” the git history. git blame becomes more tedious because you have to move past the mechanical NFC change(s) to get to the interesting bits. I hear there’s a git trick that specifies a set of commits to ignore when doing this, though, so perhaps a big-bang change (or more than one; we are talk a lot of files here, people) might not be so disruptive.

I have to say, when doing archaeology myself, I almost always have to step past multiple prior commits to get to the change I care about, so IMO this is not as big a problem as people make out. But maybe it bothers other people more than it bothers me.

Tool versioning

clang-format is, like anything else in the project, subject to change. That means its decisions might vary over time, making the specification of “clang-format clean” less precise than might be desirable. There are a couple of ways to address this.

Live with it. The CI check might not 100% agree with whatever newer version you might have run on your patch, but my guess is the differences will be rare. There is after all a way for clang-format testing to verify that it has not regressed (see clang/docs/tools/clang-formatted-files.txt and the patch that introduced it (thanks to @asb for identifying this!).
Commit a version to the repo, say once per release, and use that. This means that if something does change in a desirable way, the new version won’t be available until the next release occurs, which will certainly make somebody sad.

What’s Next?

I would like to see:

Agreement in principle that this is a good thing to do. I think shortly after a release branch has been cut is among the least terrible times to make a blanket change like this.
Decision about whether a big-bang formatting commit is appropriate, or whether we’d rather do things in a more granular way. Per-project big bangs would also be an option.
If we don’t big-bang, then agreement that code owners and other reviewers should encourage people to clang-format the files they’re working on first, before proceeding with the functional change. This is obviously much easier for people with commit access.

It would also be best, IMO, if this format-all-the-things process did not take an awfully long time; a month, or maybe two, say. In principle this has been going on for years already, and we’ve made so-so progress. Time to bite the bullet and get it done, I say!

tobiashieta · January 29, 2024, 7:42pm

I am in favour! The current state of style across the repo is annoying and now we actually have enforcement in the CI.

I would love the hear what @ldionne has to say about the experience of “big bang” formatting in libc++.

tstellar · January 29, 2024, 7:50pm

I think it’s better to do it right before a release branch is made. If you do it right after, then it makes cherry-picking fixes very difficult.

MatzeB · January 29, 2024, 7:51pm

Thanks for bringing this up. I also think it is worth re-visiting this discussion. There are certainly multiple pros and cons here and we should be careful not to ignore the discussions from the past (which I can’t find URLs for right now, I hope someone remembers… )

I remember one argument being that git blame (or rather svn blame at the time, would become trickier. Though I think we should re-visit this in the light of .git-blame-ignore-revs option in git at least reducing the bad effects a bit (though I think you have to manually set things up so the file gets used).
This will be likely be a LOT of churn of downstream users with internal patches and I suspect that to be the biggest contentious point probably… At the very least we should be prepared with some guides / help on how to deal with downstream patches.

That said, I personally think the pros outweigh the cons here and I would personally love to see this move forward.

ldionne · January 29, 2024, 7:54pm

The libc++ reformatting worked really well for us. I didn’t get any report of downstreams running into major issues, and Apple’s downstream fork handled it really well too.

However, libc++ is a smaller code base than e.g. Clang, and the amount of downstream diffs present in different areas of LLVM vs libc++ may be significantly different, so I’m not certain the libc++ experience can be used to draw too many conclusions here.

Hence, I’ll have to say I am neutral regarding this – I think having a clang-formatted code base is a great state to be in, however while I know the tradeoff for libc++ was the right one (in terms of effort, disruptiveness and benefits), I don’t claim to know what that tradeoff is like for other parts of LLVM. But for libc++ at least, the experience has been positive so far.

MatzeB · January 29, 2024, 7:56pm

My two cents on process:

Formatting whole projects (or rather llvm-project/xxx subdirs) at-once is probably a good level of granularity. Doing it in smaller increments will unnecessarily distribute the trouble over a longer time IMO.
We should enable CI enforcement in a timely manner after reformatting a subproject.
Can we do this with some subprojects with less activity first to gain some experience on how to do this, how to deal with in-flight PRs, how to setup CI testing for the clean subprojects, … Would also love to hear how this went for libc++

kbeyls · January 29, 2024, 7:59pm

Regarding tool versioning: I guess another option could also be to fix on a specific version of clang-format and only move the version after an RFC, similarly to how we discuss when we decide to move forward on versions of cmake, python, c++ version used, or other similar dependencies?

Not sure if that would be the best option, just thought I’d raise that as a plausible option too.

pogo59 · January 29, 2024, 8:00pm

As a downstream repo maintainer, that’s quite true. I personally would prefer doing the reformats in smaller chunks to make each individual conflict less painful. My primary recommendation would be to make sure the downstream patches are themselves properly formatted first.

MatzeB · January 29, 2024, 8:03pm

As a downstream repo maintainer, that’s quite true. I personally would prefer doing the reformats in smaller chunks to make each individual conflict less painful. My primary recommendation would be to make sure the downstream patches are themselves properly formatted first.

We also have a bunch of downstream patches here. Though it feels to me like this would mostly need one person to figure out the right processed/tools of reformatting a patch and then applying that to all downstream patches. And I would expect that to be most efficient if you figure it out once and then apply to all patches, rather than doing it again and again as things change piecemeal. (Though not a strong opinion either way)

philnik · January 29, 2024, 8:12pm

I think this would be really bad for libc++. We’re using bleeding edge features all the time, which makes us very reliant on up-to-date tooling. We’d have to write an RFC basically every release to get better formatting for new features.

pogo59 · January 29, 2024, 8:13pm

Hmmm… I was thinking it would hard to generalize, but actually: if the patch is strictly a reformat, then the merge conflict resolution could be as simple as (a) take your downstream version of the file, (b) clang-format it.

slinder1 · January 29, 2024, 8:25pm

I agree wholeheartedly that we should implement this!

The blame.ignoreRevsFile mechanism is enabled by default on GitHub for a file called .git-blame-ignore-revs, see Viewing a file - GitHub Docs. I don’t know that git-blame documents a canonical filename anywhere, so this seems like the best choice?

The only issue is that as @MatzeB mentions it is up to the user to enable it in their local git client. I think this is OK, since as @pogo59 mentions not every user will find the extra commits annoying, and we can include it as an optional step in our docs so those that do are aware of it:

# OPTIONAL: ignore select NFC (No Functional Change) commits in git-blame
git config blame.ignoreRevsFile .git-blame-ignore-revs

As for the downstream issue, it seem like we should be able to automate the process:

We can ensure we only reformat whole-files, and note these NFC commits in a “format-log” somewhere, including clang-format version info
Downstream users/CI can detect new changes in the format-log and repeat a process for every new entry:
- First get the same clang-format version (hand waving the issue of how for now) and confirm it is a no-op when run over the upstream format-commit
- Then do their usual merge/rebase based on the parent of the format-commit
- Then run clang-format over the files that were modified in the format-commit
- Finally do the merge/rebase for the format-commit proper

We can script this for various workflows to ease the process, and make it so even future automated fixups can be handled seamlessly.

pogo59 · January 29, 2024, 8:29pm

Here’s a question: What version of clang-format does the pre-CI job use? If it’s not basically “the most recent” then how is it updated?

pogo59 · January 29, 2024, 8:35pm

Good point. I think it would worth waiting at least until the pace of fixes to the release branch has slowed down. I don’t know if I’d want to wait all the way until June.

boomanaiden154-1 · January 29, 2024, 8:36pm

It just uses the latest release version (or at least did the last time someone touched the clang-format version). It’s at 17.0.1 currently.

github.com

llvm/llvm-project/blob/e633807a1fccbed91dbfe1fdc2c78adcaf21d99c/.github/workflows/pr-code-format.yml#L52


      
          - name: "Listed files"
            env:
              CHANGED_FILES: ${{ steps.changed-files.outputs.all_changed_files }}
            run: |
              echo "Formatting files:"
              echo "$CHANGED_FILES"
          
          - name: Install clang-format
            uses: aminya/setup-cpp@v1
            with:
              clangformat: 17.0.1
          
          - name: Setup Python env
            uses: actions/setup-python@v4
            with:
              python-version: '3.11'
              cache: 'pip'
              cache-dependency-path: 'code-format-tools/llvm/utils/git/requirements_formatting.txt'
          
          - name: Install python dependencies
            run: pip install -r code-format-tools/llvm/utils/git/requirements_formatting.txt

slinder1 · January 29, 2024, 8:39pm

If we assume that getting the clang-format binary when given a version is a solved problem then we could just maintain a file committed to the repo with said version:

Upstream CI reads the version from one file, does the format and gates reviews on it.
A maintenance job that cleans up after mistakes and handles changes to the version file appends to the log, including the version.
Downstream CI reads the version from the log.

Then, as the version we use changes everything continues to work and there is an automated path for downstream users.

Maybe the difficulty is in the “automatically get/build an executable copy of clang-format for an arbitrary CI host given a version string” part?

RKSimon · January 29, 2024, 11:00pm

Regarding some of the potential churn from this - do we have any styleguide rules regarding the use of // clang-format off + // clang-format on tags in the source code? And should we consider using it in more places?

MatzeB · January 29, 2024, 11:31pm

do we have any styleguide rules regarding the use of // clang-format off + // clang-format on tags in the source code?

I think it’s fine to use it, so not really necessary to document that fact.

But I assume you have a list of situations in mind where it would be recommended to turn formatting off?

MatzeB · January 29, 2024, 11:35pm

Usually the problem of “just use the latest version” is that new versions can change the formatting so in order to stay 100% clean you would somehow need to reformat every time the clang-format version changes. I don’t know if that can or should be automated, but personally I feel that documenting/setting a fixed version and manually upgrading from time to time may be the safer choice after mass-reformatting…

boomanaiden154-1 · January 30, 2024, 4:56am

Right. These are good points. We don’t have any plan on what exactly to do with the code formatting action. It was right around the beginning of the 17 release cycle, originally running clang-format 16.0.6, and then being bumped to 17.0.1 (which was to fix some bug with code indentation), so there isn’t a lot of experience yet on how the version bumps impact the CI job for this specifically.

We will almost certainly end up using whatever version/upgrade plan that comes out of this RFC.

Topic		Replies	Views
[RFC] Clang-formatting all of libc++ once and for all C++ libcxx	32	1993	January 1, 2024
[RFC] Semi-Automatic clang-format of files with low frequency LLVM Project	1	380	June 27, 2020
RFC: clang-format github action LLVM Project rfc , github , clang-format	33	1030	September 19, 2023
Keeping LLVM Clang-Format Clean Project Infrastructure	4	324	January 27, 2022
[RFC] Semi-Automatic clang-format of files with low frequency LLVM Dev List Archives	27	120	July 1, 2020