Proposal
We should be pro-active about running clang-format on C/C++ sources throughout the project. In particular, it should explicitly not be a problem to commit format-only patches. The ultimate goal is to be 100% clang-format clean.
Motivation
- Formatting with clang-format is in fact our policy.
- A consistent style across the code base makes all our code easier to deal with.
- Since the adoption of the policy, many subprojects are close to conforming already.
- With the introduction of a github action to run clang-format, we can easily keep from backsliding (for any commits introduced via PRs).
Status (late January 2024)
I ran clang-format on nearly all subtrees of llvm-project, looking specifically at files with the .c
, .cpp
, .h
, .inc
, and .def
suffixes. I excluded test-like directories, including test
and benchmarks
subdirectories as well as the cross-project-tests
tree, as tests often have special formatting that we don’t want to disturb. (I did not exclude unittests
.) I also excluded the imported projects that I’m aware of or tripped over, meaning the third-party
directory and libc/AOR_v20.02
, as we have no business clang-formatting code taken verbatim from other projects.
My results are in the following table. “Small” means the file had 1-10 lines of change; “Medium” means 11-20 lines of change. “Large” is anything > 20 lines.
Dir | Source files | Files changed | Lines changed | Small | Medium | Large |
---|---|---|---|---|---|---|
bolt | 207 | 54 | 1653 | 30 | 6 | 18 |
clang | 2393 | 1525 | 147145 | 644 | 237 | 644 |
clang-tools-extra | 1325 | 241 | 2263 | 208 | 18 | 15 |
compiler-rt | 1207 | 570 | 18806 | 292 | 98 | 180 |
flang | 694 | 56 | 354 | 49 | 3 | 4 |
libc | 1700 | 66 | 233842 | 64 | 1 | 1 |
libclc | 375 | 122 | 1553 | 94 | 10 | 18 |
libcxx | 1104 | 59 | 274 | 55 | 3 | 1 |
libcxxabi | 34 | 30 | 3888 | 8 | 2 | 20 |
libunwind | 20 | 17 | 1962 | 3 | 1 | 13 |
lld | 214 | 84 | 763 | 64 | 14 | 6 |
lldb | 2337 | 865 | 16669 | 665 | 99 | 101 |
llvm | 6666 | 3411 | 212693 | 1784 | 461 | 1166 |
mlir | 1778 | 77 | 331 | 75 | 1 | 1 |
openmp | 228 | 56 | 257 | 51 | 4 | 1 |
polly | 853 | 423 | 125400 | 134 | 56 | 233 |
pstl | 34 | 13 | 631 | 9 | 1 | 3 |
utils | 3 | 1 | 1 | 1 | 0 | 0 |
I eyeballed some of the diffs, and it’s clear that there’s more to this than just running clang-format on everything. (I didn’t eyeball everything, so other interesting cases might exist.)
Some .inc/.def files are not true C/C++ sources; libc especially has files with what look like special tokens for generating the “real” source. These should probably be renamed to “.in” files, or something. If we leave those aside, libc looks to be nearly 100% clean.
The only major problem with clang-format itself, that I noticed, is that it seems not to handle trailing return types very well. There appear to be some existing issues in this area. The flang
and libcxx
projects in particular seemed to fall victim to this.
There are of course other quirks of clang-format that I personally would find annoying, but I’m not sure they rise to the level of “bugs.”
What about non-C/C++ code?
The other major language that I know LLVM uses and clang-format
supports is TableGen. In my first pass of collecting data, I eyeballed some of the .td
file changes, and some of them were more peculiar than I’d like. Indentation was off, some things got mooshed together that shouldn’t have been, and so on. My guess is that TableGen support has seen a whole lot less love than C/C++ support.
So, because this is my RFC and I can make it say what I want to, I’m excluding .td
files.
Objections
This section is based on my memory of other discussions that related to this topic. Feel free to mention other objections.
Churn
When we first adopted clang-format as the pretty-printer of record, the project explicitly rejects bulk-formatting the project on the grounds that it would create a huge amount of churn, and disrupt areas that people were actively working on. This decision was consistent with the same decision made the last time we updated the Coding Standards.
I think the “churn” argument is much less valid than it was at the time. Most of the smaller, newer projects are close to 100% clean already. (Although I didn’t find any that are all the way there; this surprises me, as I had thought libcxx went through this exercise not so long ago. @ldionne ?) While we can see from the table above that clang and llvm are far from close, they are also not all that terrible.
- clang currently is only about 1/3 clean, in terms of file count; however, after formatting all the “small” and “medium” files, it will be about 75% clean.
- llvm is currently close to 50% clean; with “small” and “medium” files done, it will be over 80% clean.
The primary downside of formatting churn is, from what I can tell, that it “pollutes” the git history. git blame
becomes more tedious because you have to move past the mechanical NFC change(s) to get to the interesting bits. I hear there’s a git trick that specifies a set of commits to ignore when doing this, though, so perhaps a big-bang change (or more than one; we are talk a lot of files here, people) might not be so disruptive.
I have to say, when doing archaeology myself, I almost always have to step past multiple prior commits to get to the change I care about, so IMO this is not as big a problem as people make out. But maybe it bothers other people more than it bothers me.
Tool versioning
clang-format
is, like anything else in the project, subject to change. That means its decisions might vary over time, making the specification of “clang-format clean” less precise than might be desirable. There are a couple of ways to address this.
-
Live with it. The CI check might not 100% agree with whatever newer version you might have run on your patch, but my guess is the differences will be rare. There is after all a way for clang-format testing to verify that it has not regressed (see
clang/docs/tools/clang-formatted-files.txt
and the patch that introduced it (thanks to @asb for identifying this!). -
Commit a version to the repo, say once per release, and use that. This means that if something does change in a desirable way, the new version won’t be available until the next release occurs, which will certainly make somebody sad.
What’s Next?
I would like to see:
-
Agreement in principle that this is a good thing to do. I think shortly after a release branch has been cut is among the least terrible times to make a blanket change like this.
-
Decision about whether a big-bang formatting commit is appropriate, or whether we’d rather do things in a more granular way. Per-project big bangs would also be an option.
-
If we don’t big-bang, then agreement that code owners and other reviewers should encourage people to clang-format the files they’re working on first, before proceeding with the functional change. This is obviously much easier for people with commit access.
It would also be best, IMO, if this format-all-the-things process did not take an awfully long time; a month, or maybe two, say. In principle this has been going on for years already, and we’ve made so-so progress. Time to bite the bullet and get it done, I say!