[RFC] Semi-Automatic clang-format of files with low frequency

mydeveloperday · June 28, 2020, 3:30pm

(Copying from Discourse)

All

A couple of months ago I added the following page documentation Clang-Formatted-Status to track the status of “How Much” clang-formatted the

LLVM/Clang project is.

I’m a contributor to clang-format and would like to see LLVM 100% clang formatted so we can use LLVM as a massive test-suite for clang-format when we make changes.

In the last couple of months since we added this page the % has gone up by ~4% and this is likely in most part of either: a mention in LLVM-Weekly, the premerge checks or perhaps some recent clang-format efforts by individuals. This is fantastic and every directory that gets to 100% increase the directories that I can run against to check against.

However, it recently twigged to me that files that don’t change very often are never going to be 100% clang-formatted simply by virtue of clang-formatting all new changes.

So I 100% understand this kind of topic comes up from time to time and I understand that we don’t want to automatically clang-format the entire tree as this can disrupt peoples downstream forks, especially where they actively have code inflight.

But I wonder if we could have a general rule that said a [NFC] clang-format change could be made on ANY file that had NOT been changed in a 6/12 months period? I believe this process could be automated at least in a semi-automatic way. Once complete the pre-merge checks should maintain the current status.

This would drive the goal of completely clang-formatted source tree, without the disruption to current active areas.

Any thoughts?

MyDeveloperDay

dblaikie · June 29, 2020, 8:53pm

(Copying from Discourse)

All

A couple of months ago I added the following page documentation Clang-Formatted-Status to track the status of “How Much” clang-formatted the

LLVM/Clang project is.

I’m a contributor to clang-format and would like to see LLVM 100% clang formatted so we can use LLVM as a massive test-suite for clang-format when we make changes.

In the last couple of months since we added this page the % has gone up by ~4% and this is likely in most part of either: a mention in LLVM-Weekly, the premerge checks or perhaps some recent clang-format efforts by individuals. This is fantastic and every directory that gets to 100% increase the directories that I can run against to check against.

For that purpose, would it be possible to fork LLVM and clang-format
the whole thing & then leave that fork stagnant except when testing
changes to clang-format & migrating it to newer formatting? What's the
benefit of using a live project for this testing?

However, it recently twigged to me that files that don’t change very often are never going to be 100% clang-formatted simply by virtue of clang-formatting all new changes.

So I 100% understand this kind of topic comes up from time to time and I understand that we don’t want to automatically clang-format the entire tree as this can disrupt peoples downstream forks, especially where they actively have code inflight.

But I wonder if we could have a general rule that said a [NFC] clang-format change could be made on ANY file that had NOT been changed in a 6/12 months period? I believe this process could be automated at least in a semi-automatic way. Once complete the pre-merge checks should maintain the current status.

This would drive the goal of completely clang-formatted source tree, without the disruption to current active areas.

Any thoughts?

I don't think I'd exactly have a problem with this - but yeah, imagine
it might not be super popular either.

- Dave

jh7370 · June 30, 2020, 7:33am

In general, sounds okay to me, but one slight concern I have is that there are some areas of code I’ve seen which are deliberately unformatted for various reasons, quite often because the code looks much nicer in its current state or similar. This of course might just mean a clang-format bug fix/small behaviour adjustment etc is needed, or “do not format” markers needs adding. However, short of an audit of every file that might be affected by this, it’s hard to know when something might be affected undesirably, so I’d be marginally against it being an automated approach without some kind of manual reading of the changes involved.

James

Chris_Tetreault · June 30, 2020, 4:26pm

I’m in favor of this because, like you said, unless we do something like this, we’ll never get there. Frankly, I think the inactive period should be much shorter. I would even go so far as to say something like 1 month of inactivity. This codebase is under heavy development, so any code under active development will likely be touched in a 1 month period. 6-12 months will only catch the truly stable code, but any code that only gets minimal changes will never reach 100% coverage.

I would also like to see a threshold (maybe on a per-file or per-subfolder level of granularity) where once the coverage gets over the threshold, we just “rip the band-aid off” and clang-format everything. Suppose we set the threshold at 90%, and Foo.cpp is over 90% formatted, we just format the entire file and be done with it.

I’m sympathetic to the argument that these sorts of change make managing downstreams more difficult, I maintain one myself. However, I would argue that finishing the formatting as fast as possible will reduce downstream pain because after the codebase is completely formatted, it’ll just be done and there will be no more formatting churn. As it stands, we constantly have to deal with formatting changes because there’s always new ones.

Thanks,

Christopher Tetreault

nhaehnle · June 30, 2020, 5:17pm

I’m in favor of this because, like you said, unless we do something like this, we’ll never get there. Frankly, I think the inactive period should be much shorter. I would even go so far as to say something like 1 month of inactivity. This codebase is under heavy development, so any code under active development will likely be touched in a 1 month period. 6-12 months will only catch the truly stable code, but any code that only gets minimal changes will never reach 100% coverage.

Hard no on the shorter period. I have changes I'm working on for
several months at a time. 1 year is a minimum for this in my opinion.

I’m sympathetic to the argument that these sorts of change make managing downstreams more difficult, I maintain one myself. However, I would argue that finishing the formatting as fast as possible will reduce downstream pain because after the codebase is completely formatted, it’ll just be done and there will be no more formatting churn. As it stands, we constantly have to deal with formatting changes because there’s always new ones.

First of all, we don't currently have to deal with constant formatting
changes, at least not in the part of the code that's relevant to
AMDGPU. Why? Because we follow the (good!) rule of avoiding gratuitous
churn. This works _fine_.

Second, these kinds of things never work out that way. You think you
change everything once, but for one thing, tastes and fashions change.
For another, you probably wouldn't capture everything that everybody
wants captured. This thread talks about clang-format. Fine, but other
people talk about changing variable naming styles, and so on.

Code style just doesn't matter that much, but churn does. Don't you
folks have more pressing matters to work on? If you're bored in your
current job, consider applying with us

Cheers,
Nicolai

echristo · June 30, 2020, 5:27pm

I’m in favor of this because, like you said, unless we do something like this, we’ll never get there. Frankly, I think the inactive period should be much shorter. I would even go so far as to say something like 1 month of inactivity. This codebase is under heavy development, so any code under active development will likely be touched in a 1 month period. 6-12 months will only catch the truly stable code, but any code that only gets minimal changes will never reach 100% coverage.

Hard no on the shorter period. I have changes I’m working on for
several months at a time. 1 year is a minimum for this in my opinion.

I’m sympathetic to the argument that these sorts of change make managing downstreams more difficult, I maintain one myself. However, I would argue that finishing the formatting as fast as possible will reduce downstream pain because after the codebase is completely formatted, it’ll just be done and there will be no more formatting churn. As it stands, we constantly have to deal with formatting changes because there’s always new ones.

First of all, we don’t currently have to deal with constant formatting
changes, at least not in the part of the code that’s relevant to
AMDGPU. Why? Because we follow the (good!) rule of avoiding gratuitous
churn. This works fine.

Second, these kinds of things never work out that way. You think you
change everything once, but for one thing, tastes and fashions change.
For another, you probably wouldn’t capture everything that everybody
wants captured. This thread talks about clang-format. Fine, but other
people talk about changing variable naming styles, and so on.

Code style just doesn’t matter that much, but churn does. Don’t you
folks have more pressing matters to work on? If you’re bored in your
current job, consider applying with us

I don’t think this is a productive comment - it attacks the person and their contributions rather than addressing the matter at hand. Also there’s quite a lot of research out there that has things like uniform code formatting and variable naming actually help productivity of a project as a whole so there’s that as well.

Thanks.

-eric

arsenm · June 30, 2020, 5:55pm

My main issue with this would be that clang-format does things that I don’t believe are stated in the LLVM style guide and I also disagree with. There’s a whole set of cases where it makes unwelcome changes to put things that were on separate lines on a single line. Whenever I run clang-format, I typically git checkout -p to discard all of these.

For example, it likes to take member initializer lists and pack them into as few lines as possible. This is just asking for merge conflicts (e.g. AMDGPUSubtarget has a ton of fields in it, and out of tree code is constantly adding new fields for unreleased subtargets). It also mangles BuildMI calls, where I believe every .add() should be on its own line, and stringing this into BuildMI().addFoo().addBar() is way less readable.

I also believe it’s 4 space indent on line wraps differs from the stated 2 space indent level (and this also disagrees with emacs llvm-style)

-Matt

Finkel_Hal_J · June 30, 2020, 6:12pm

I’m a contributor to clang-format and would like to see LLVM 100% clang formatted so we can use LLVM as a massive test-suite for clang-format when we make changes.

My main issue with this would be that clang-format does things that I don’t believe are stated in the LLVM style guide and I also disagree with. There’s a whole set of cases where it makes unwelcome changes to put things that were on separate lines on a single line. Whenever I run clang-format, I typically git checkout -p to discard all of these.

For example, it likes to take member initializer lists and pack them into as few lines as possible. This is just asking for merge conflicts (e.g. AMDGPUSubtarget has a ton of fields in it, and out of tree code is constantly adding new fields for unreleased subtargets). It also mangles BuildMI calls, where I believe every .add() should be on its own line, and stringing this into BuildMI().addFoo().addBar() is way less readable.

I agree that this is problematic. There are a number of places where repeated syntactic structures (e.g., in enums, initializers, chained-operator invocation) are represented using regular line offsets and line breaks. This is valuable, allows for semantically-relevant grouping, and in my opinion, enhances readability. I've certainly caught bugs in code that I've written because they became obvious once I lined everything up. Can clang-format be taught to format things in this way? For particular, long sections of code, I can imagine using some don't-format-here markers, but you wouldn't want these around every BuildMI(...) call.

-Hal

mydeveloperday · June 30, 2020, 6:19pm

I 100% get that we might not like the decisions clang-format is making, but how does one overcome this when adding new code? The pre-merge checks enforce clang-formatting before commit and that’s a common review comment anyway for those who didn’t join the pre-merge checking group. I’m just wondering are we not all following the same guidelines?

Concerns of clang-format not being good enough are fair enough, but that’s the area I’m focusing my development efforts on (as that’s where I’ve been trying to make a small contribution). This effort was driven out of a need in my view to have an extensive test suite to validate new changes to clang-format don’t break the formatting of pre-formatted code. This isn’t that possible with LLVM at the moment because large parts are not formatted.

However checking a code base which is in high flux but one that maintains a 100% clang-format clean status, is a near perfect test-suite. Especially one that I assume will have unit tests for the latest C++ features.

I’m not bored and whilst I understand this feels somewhat periphery to the seemingly much more pressing task of developing the next best compiler, I think we have to remember that clang-format is extensively used. (In my view by many more people than those actually using clang). GitHub reports 32,000 YAML files with the BasedOnStyle: LLVM alone that is present in the .clang-format file

I realize it feels like unnecessary churn which is why I suggested this be in code which hasn’t been touched in some time (I’d be fine if that time is 1+ years), but more often than not this is quite small basic style issues, similar to the ones below.

MyDeveloperDay

void HandleTranslationUnit(ASTContext& context) override {

void HandleTranslationUnit(ASTContext &context) override {
if (!Instance.getLangOpts().DelayedTemplateParsing)
return;

std::set<FunctionDecl*> LateParsedDecls;

std::set<FunctionDecl *> LateParsedDecls;

Chris_Tetreault · June 30, 2020, 6:35pm

At my previous workplace, we had a codebase that was 100% completely formatted via clang-format. I had a coworker that hated the local style. However, since we had complete coverage, he was able to set up git hooks to run clang-format to reformat the codebase to his preferred style on checkout and back on commit. He was only able to do this because the coverage was complete.

This doesn’t help with the merge conflict issue, but it definitely helps with the “clang-format does this thing that I don’t like” issue. However, until we finish formatting the codebase, nobody can do this.

Finkel_Hal_J · June 30, 2020, 6:45pm

I 100% get that we might not like the decisions clang-format is making, but how does one overcome this when adding new code? The pre-merge checks enforce clang-formatting before commit and that's a common review comment anyway for those who didn't join the pre-merge checking group.

Those checks are advisory. I do not recall a case where thoughtfully-hand-formatted code was proposed, adhering to the coding guidelines, and a reviewer asked for the file to be run through clang-format. When the thoughtfully-formatted code does not match what clang-format would produce in such cases, it's generally obvious why. Normally, this request is made where the file is inconsistently formatted, or formatted in a way that obviously does not comply with our guidelines.

I've observed that there is a wide spectrum of developer preferences: some thoughtfully and deliberately format all of their code, others deal with formatting only when they feel like they absolutely must and are happy to have a tool that does anything consistent. Most people are probably somewhere in between.

I'm just wondering are we not all following the same guidelines?

We have specific guidelines that everyone should follow. clang-format implements a superset of those.

Here's a counter-proposal: You should hook up a tool that automatically opens Phabricator reviews to clang-format the source code. It should do this at a very low rate. It may take years to get through everything. But, humans will look at the formatting changes, and where clang-format should be improved instead of changing the code, we'll discover and classify those things. Formatted files can be noted, however, to incrementally build the clang-format test suite.

Concerns of clang-format not being good enough are fair enough, but that's the area I'm focusing my development efforts on (as that's where I've been trying to make a small contribution). This effort was driven out of a need in my view to have an extensive test suite to validate new changes to clang-format don't break the formatting of pre-formatted code. This isn't that possible with LLVM at the moment because large parts are not formatted.

However checking a code base which is in high flux but one that maintains a 100% clang-format clean status, is a near perfect test-suite. Especially one that I assume will have unit tests for the latest C++ features.

I'm not bored and whilst I understand this feels somewhat periphery to the seemingly much more pressing task of developing the next best compiler, I think we have to remember that clang-format is extensively used. (In my view by many more people than those actually using clang). GitHub reports 32,000 YAML files with the BasedOnStyle: LLVM alone that is present in the .clang-format file

Independent of everything else, this is a great success. Maybe we should have a website with a tracker on it or something.

Thanks again,

Hal

clattner · June 30, 2020, 8:47pm

(Copying from Discourse)

All

A couple of months ago I added the following page documentation Clang-Formatted-Status to track the status of “How Much” clang-formatted the

LLVM/Clang project is.

Wow, I hadn’t seen this - this is really cool!

I’m a contributor to clang-format and would like to see LLVM 100% clang formatted so we can use LLVM as a massive test-suite for clang-format when we make changes.

In the last couple of months since we added this page the % has gone up by ~4% and this is likely in most part of either: a mention in LLVM-Weekly, the premerge checks or perhaps some recent clang-format efforts by individuals. This is fantastic and every directory that gets to 100% increase the directories that I can run against to check against.

I’m a huge fan of clang-format, and have worked in “required to be formatted” code bases - it is a way better way to work in my opinion. That said, such a move that you’re talking about is something that will take time, because this is a social problem as well as a technical one.

Instead of starting from a “can we do everything” question (which you’re already getting some “no’s”), I’d start with an easier question of “is anyone ok with mechanically formatting any pieces of llvm?” There will be some folks (e.g. the MLIR ones, perhaps the flang ones?) that are more likely to be early adopters, and this gives the opportunity to figure out policies and tools to help maintain things. Then you can move on to other subdirectories within llvm/clang/etc, and progressively get there.

This helps achieve your goal of getting more of the stack, but avoids it being an “all or nothing” sort of dichotomy.

-Chris

nhaehnle · June 30, 2020, 8:56pm

I 100% get that we might not like the decisions clang-format is making, but how does one overcome this when adding new code? The pre-merge checks enforce clang-formatting before commit and that's a common review comment anyway for those who didn't join the pre-merge checking group. I'm just wondering are we not all following the same guidelines?

Concerns of clang-format not being good enough are fair enough, but that's the area I'm focusing my development efforts on (as that's where I've been trying to make a small contribution). This effort was driven out of a need in my view to have an extensive test suite to validate new changes to clang-format don't break the formatting of pre-formatted code. This isn't that possible with LLVM at the moment because large parts are not formatted.

However checking a code base which is in high flux but one that maintains a 100% clang-format clean status, is a near perfect test-suite. Especially one that I assume will have unit tests for the latest C++ features.

I'm not bored and whilst I understand this feels somewhat periphery to the seemingly much more pressing task of developing the next best compiler, I think we have to remember that clang-format is extensively used. (In my view by many more people than those actually using clang). GitHub reports 32,000 YAML files with the BasedOnStyle: LLVM alone that is present in the .clang-format file

My comment there was more targeted against those who would take your
idea and push it to unnecessary and counterproductive extremes. I do
appreciate that you are trying to solve an actual problem _for the
development of clang-format_

Since you're interested in thoughts about clang-format's capabilities,
I agree with Matt that the strictness of the approach is a weakness
that can frequently make code *less* readable. In addition to what he
mentions, here's an example of a bad change that clang-format wants to
make that I found after half a minute of scanning our AMDGPU backend
code:

   // We only support LOAD/STORE and vector manipulation ops for vectors
   // with > 4 elements.
- for (MVT VT : { MVT::v8i32, MVT::v8f32, MVT::v16i32, MVT::v16f32,
- MVT::v2i64, MVT::v2f64, MVT::v4i16, MVT::v4f16,
- MVT::v4i64, MVT::v4f64, MVT::v8i64, MVT::v8f64,
- MVT::v16i64, MVT::v16f64, MVT::v32i32, MVT::v32f32 }) {
+ for (MVT VT :
+ {MVT::v8i32, MVT::v8f32, MVT::v16i32, MVT::v16f32, MVT::v2i64,
+ MVT::v2f64, MVT::v4i16, MVT::v4f16, MVT::v4i64, MVT::v4f64, MVT::v8i64,
+ MVT::v8f64, MVT::v16i64, MVT::v16f64, MVT::v32i32, MVT::v32f32}) {
     for (unsigned Op = 0; Op < ISD::BUILTIN_OP_END; ++Op) {
       switch (Op) {

I don't particularly mind that clang-format puts that initializer list
onto a new line, or that it changes the whitespace around braces. What
I do mind: the original code lays the initializer list out carefully
such that integer and floating point types always come in pairs on the
same line: v8[if]32 and v16[if]32 on the first line, v2[if]64 and
v4[if]64 on the second line, and so on. clang-format mindlessly mushes
the initializer list together in a way that makes it harder to see at
a glance what is going on.

(Note that this isn't code that I wrote, so I'm not emotionally
attached to it or anything. But I've run into this kind of problem
many times during development.)

I believe the common theme here is that clang-format ought to have a
mode in which it is more accepting of different layouts of lists
co-existing within the same code base. If that feature was available,
I would be a strong proponent for adopting it in LLVM itself.

Cheers,
Nicolai

rupprecht · June 30, 2020, 9:17pm

For the vote count, I’m also in the group that we should format everything. Even if you aren’t making modifications to the code, it’s difficult to read code you’re trying to debug if it’s in a different style. Of course, these formatting changes should go through human review, and we should hold off on problematic places until we can fix clang-format to deal with it better (or use tricks like comments for alignment).

For the argument about downstream forks experiencing churn – is it enough to say that if your downstream fork hits a merge conflict, just back out of the merge and clang-format those files in your downstream fork, and your next attempt to rebase should usually work?

I think posting some clang-formatting patches, and looking at more “bad” examples might help. There are usually workarounds:

…

Since you’re interested in thoughts about clang-format’s capabilities,
I agree with Matt that the strictness of the approach is a weakness
that can frequently make code less readable. In addition to what he
mentions, here’s an example of a bad change that clang-format wants to
make that I found after half a minute of scanning our AMDGPU backend
code:

// We only support LOAD/STORE and vector manipulation ops for vectors
// with > 4 elements.

for (MVT VT : { MVT::v8i32, MVT::v8f32, MVT::v16i32, MVT::v16f32,

MVT::v2i64, MVT::v2f64, MVT::v4i16, MVT::v4f16,

MVT::v4i64, MVT::v4f64, MVT::v8i64, MVT::v8f64,

MVT::v16i64, MVT::v16f64, MVT::v32i32, MVT::v32f32 }) {

for (MVT VT :

{MVT::v8i32, MVT::v8f32, MVT::v16i32, MVT::v16f32, MVT::v2i64,

MVT::v2f64, MVT::v4i16, MVT::v4f16, MVT::v4i64, MVT::v4f64, MVT::v8i64,

MVT::v8f64, MVT::v16i64, MVT::v16f64, MVT::v32i32, MVT::v32f32}) {
for (unsigned Op = 0; Op < ISD::BUILTIN_OP_END; ++Op) {
switch (Op) {

I don’t particularly mind that clang-format puts that initializer list
onto a new line, or that it changes the whitespace around braces. What
I do mind: the original code lays the initializer list out carefully
such that integer and floating point types always come in pairs on the
same line: v8[if]32 and v16[if]32 on the first line, v2[if]64 and
v4[if]64 on the second line, and so on. clang-format mindlessly mushes
the initializer list together in a way that makes it harder to see at
a glance what is going on.

A common strategy for this type of pattern is to use trailing comments for alignment:
for (MVT VT : {
MVT::v8i32, MVT::v8f32, MVT::v16i32, MVT::v16f32, //
MVT::v2i64, MVT::v2f64, MVT::v4i16, MVT::v4f16, //
MVT::v4i64, MVT::v4f64, MVT::v8i64, MVT::v8f64, //
MVT::v16i64, MVT::v16f64, MVT::v32i32, MVT::v32f32 //
}) {

Live example: https://godbolt.org/z/PehFcC

(( Of course, even better, you can add text to those comments, like “// v8, v16”. But it’s a common enough pattern to use empty alignment comments. ))

(Note that this isn’t code that I wrote, so I’m not emotionally
attached to it or anything. But I’ve run into this kind of problem
many times during development.)

I believe the common theme here is that clang-format ought to have a
mode in which it is more accepting of different layouts of lists
co-existing within the same code base. If that feature was available,
I would be a strong proponent for adopting it in LLVM itself.

Does “// clang-format [ on | off]” suffice as an escape hatch?

Chris_Tetreault · June 30, 2020, 10:03pm

My comment there was more targeted against those who would take your idea and push it to unnecessary and counterproductive extremes.

There's nothing counterproductive about just formatting everything and being done with it. The goal of doing clang-format-diff as we commit code is to minimize churn so that there isn't one big patch that conflicts with everything. But if you only format the lines that get touched by a commit, the work will never be done. At some point you need to finish the work, or the tool might as well be called clang-make-the-code-inconsistent.

Since you're interested in thoughts about clang-format's capabilities, I agree with Matt that the strictness of the approach is a weakness that can frequently make code *less* readable. In addition to what he mentions, here's an example of a bad change that clang-format wants to make that I found after half a minute of scanning our AMDGPU backend
code:

  // We only support LOAD/STORE and vector manipulation ops for vectors
  // with > 4 elements.
- for (MVT VT : { MVT::v8i32, MVT::v8f32, MVT::v16i32, MVT::v16f32,
- MVT::v2i64, MVT::v2f64, MVT::v4i16, MVT::v4f16,
- MVT::v4i64, MVT::v4f64, MVT::v8i64, MVT::v8f64,
- MVT::v16i64, MVT::v16f64, MVT::v32i32, MVT::v32f32 }) {
+ for (MVT VT :
+ {MVT::v8i32, MVT::v8f32, MVT::v16i32, MVT::v16f32, MVT::v2i64,
+ MVT::v2f64, MVT::v4i16, MVT::v4f16, MVT::v4i64, MVT::v4f64, MVT::v8i64,
+ MVT::v8f64, MVT::v16i64, MVT::v16f64, MVT::v32i32,
+ MVT::v32f32}) {
    for (unsigned Op = 0; Op < ISD::BUILTIN_OP_END; ++Op) {
      switch (Op) {

I don't particularly mind that clang-format puts that initializer list onto a new line, or that it changes the whitespace around braces. What I do mind: the original code lays the initializer list out carefully such that integer and floating point types always come in pairs on the same line: v8[if]32 and v16[if]32 on the first line, v2[if]64 and
v4[if]64 on the second line, and so on. clang-format mindlessly mushes the initializer list together in a way that makes it harder to see at a glance what is going on.

(Note that this isn't code that I wrote, so I'm not emotionally attached to it or anything. But I've run into this kind of problem many times during development.)

I believe the common theme here is that clang-format ought to have a mode in which it is more accepting of different layouts of lists co-existing within the same code base. If that feature was available, I would be a strong proponent for adopting it in LLVM itself.

Cheers,
Nicolai

The problem is that while you may find the way it's formatted to be easier to read, not everybody does. If I were formatting it manually, I probably would have had 8 lines of pairs, with one pair per line, or I would have had all the integer types then all the float types if the order is not important. Now we have a disagreement, and a needless code review issue in the future when somebody adds a new MVT that needs to be handled and fails to figure out the pattern. That's the point of the tool. It can't be reasoned with. It can't be bargained with. It doesn't feel pity of remorse or fear. And it absolutely will not stop. Ever. Until your code is formatted correctly.

Arguing about code format in code review is a productivity drain. Time spent figuring out the style of this one random file is a productivity drain. Time spent inserting a bunch of spaces in a 43 line switch because you added a new case that is just 1 character too long is a productivity drain. We have a tool that can do this for us. Maybe it doesn't format everything exactly how we'd like, but at least you won't have to justify it in code review. Sure, maybe it will miss things, but at least you don't have to push a new patch because a line is too long or you put the * next to the type rather than the variable name. The tool will evolve and improve. Frankly, I'm surprised that it doesn't already perfectly implement the LLVM coding style, but surely this is the goal.

From your previous email:

Hard no on the shorter period. I have changes I'm working on for several months at a time. 1 year is a minimum for this in my opinion.

Your patch should already be formatted using clang-format-diff, right? Phabricator is configured to complain on your commit if this is not the case, so I don't see how a clang-format commit could possibly cause a conflict for you.

Timothy_Keith · June 30, 2020, 11:54pm

The flang sub-project is supposed to be fully formatted so it wouldn’t be hard to get to 100% for a test like this and we would be happy to participate. We were at 100% at one point but there are about 10 files that need formatting now, in part because of changes to clang-format. It would be good to get early warning of clang-format changes like that.

Let me know if there is anything I can do to help.

Tim

Zachary_Turner2 · July 1, 2020, 1:36am

Note that this is solvable. Put a comma at the end of your initializer list, always. This works for enums, array initializers, and AFAIK just arbitrary cases of comma separated things between curly braces. I personally think that it’s absolutely ridiculous that this “hack” exists and that there is no proper clang-format style option, but then again… I haven’t proposed or contributed a fix, so I’m not exactly helping.

Finkel_Hal_J · July 1, 2020, 2:03am

My comment there was more targeted against those who would take your idea and push it to unnecessary and counterproductive extremes.

There's nothing counterproductive about just formatting everything and being done with it. The goal of doing clang-format-diff as we commit code is to minimize churn so that there isn't one big patch that conflicts with everything. But if you only format the lines that get touched by a commit, the work will never be done. At some point you need to finish the work, or the tool might as well be called clang-make-the-code-inconsistent.

I agree. But I think that we should view this as an iterative process. We should generate patches for review, and realize that we might need to update clang-format in some cases instead of just always updating the code. clang-format has evolved over time, and I doubt that evolution has stopped now.

Since you're interested in thoughts about clang-format's capabilities, I agree with Matt that the strictness of the approach is a weakness that can frequently make code *less* readable. In addition to what he mentions, here's an example of a bad change that clang-format wants to make that I found after half a minute of scanning our AMDGPU backend
code:

// We only support LOAD/STORE and vector manipulation ops for vectors
// with > 4 elements.
- for (MVT VT : { MVT::v8i32, MVT::v8f32, MVT::v16i32, MVT::v16f32,
- MVT::v2i64, MVT::v2f64, MVT::v4i16, MVT::v4f16,
- MVT::v4i64, MVT::v4f64, MVT::v8i64, MVT::v8f64,
- MVT::v16i64, MVT::v16f64, MVT::v32i32, MVT::v32f32 }) {
+ for (MVT VT :
+ {MVT::v8i32, MVT::v8f32, MVT::v16i32, MVT::v16f32, MVT::v2i64,
+ MVT::v2f64, MVT::v4i16, MVT::v4f16, MVT::v4i64, MVT::v4f64, MVT::v8i64,
+ MVT::v8f64, MVT::v16i64, MVT::v16f64, MVT::v32i32,
+ MVT::v32f32}) {
for (unsigned Op = 0; Op < ISD::BUILTIN_OP_END; ++Op) {
switch (Op) {

I don't particularly mind that clang-format puts that initializer list onto a new line, or that it changes the whitespace around braces. What I do mind: the original code lays the initializer list out carefully such that integer and floating point types always come in pairs on the same line: v8[if]32 and v16[if]32 on the first line, v2[if]64 and
v4[if]64 on the second line, and so on. clang-format mindlessly mushes the initializer list together in a way that makes it harder to see at a glance what is going on.

(Note that this isn't code that I wrote, so I'm not emotionally attached to it or anything. But I've run into this kind of problem many times during development.)

I believe the common theme here is that clang-format ought to have a mode in which it is more accepting of different layouts of lists co-existing within the same code base. If that feature was available, I would be a strong proponent for adopting it in LLVM itself.

Cheers,
Nicolai

The problem is that while you may find the way it's formatted to be easier to read, not everybody does. If I were formatting it manually, I probably would have had 8 lines of pairs, with one pair per line, or I would have had all the integer types then all the float types if the order is not important. Now we have a disagreement, and a needless code review issue in the future when somebody adds a new MVT that needs to be handled and fails to figure out the pattern. That's the point of the tool. It can't be reasoned with. It can't be bargained with. It doesn't feel pity of remorse or fear. And it absolutely will not stop. Ever. Until your code is formatted correctly.

Okay, wait. Let's not solve problems that we don't have. I don't ever recall such a debate over formatting in a review thread (and while I don't any longer, for years, I read every review on the mailing list). I agree that your formatting suggestion is better than the current one, but if both the current formatting and your suggested improvement are better than what clang-format provides, so using the clang-format version seems like a regression. We should improve things so that using clang-format would be an improvement, or we should figure out how to exempt this part of the code.

Arguing about code format in code review is a productivity drain. Time spent figuring out the style of this one random file is a productivity drain. Time spent inserting a bunch of spaces in a 43 line switch because you added a new case that is just 1 character too long is a productivity drain. We have a tool that can do this for us. Maybe it doesn't format everything exactly how we'd like, but at least you won't have to justify it in code review. Sure, maybe it will miss things, but at least you don't have to push a new patch because a line is too long or you put the * next to the type rather than the variable name. The tool will evolve and improve. Frankly, I'm surprised that it doesn't already perfectly implement the LLVM coding style, but surely this is the goal.

No one I know likes spending their time formatting code. We do it when we feel it is important. Code is read more than it is written, and optimizing to make the code easier to read is very important. Formatting lists of things, etc. nicely has helped me catch bugs in the past, and even if the price for that is having to comment in a review thread about the formatting, or spend some time formatting, I'll take it. When things are nicely arranged, it's sometimes more obvious when something just isn't right. Tracking down bugs in compilers takes far more time than all of that.

Thanks again,

Hal

mehdi_amini · July 1, 2020, 3:36am

I 100% get that we might not like the decisions clang-format is making, but how does one overcome this when adding new code? The pre-merge checks enforce clang-formatting before commit and that’s a common review comment anyway for those who didn’t join the pre-merge checking group.

Those checks are advisory. I do not recall a case where thoughtfully-hand-formatted code was proposed, adhering to the coding guidelines, and a reviewer asked for the file to be run through clang-format.

In general I’d ask for the magic comment that disables clang-format on the particular section though, which keeps the file clean with respect to clang-format.

When the thoughtfully-formatted code does not match what clang-format would produce in such cases, it’s generally obvious why. Normally, this request is made where the file is inconsistently formatted, or formatted in a way that obviously does not comply with our guidelines.

I’ve observed that there is a wide spectrum of developer preferences: some thoughtfully and deliberately format all of their code, others deal with formatting only when they feel like they absolutely must and are happy to have a tool that does anything consistent. Most people are probably somewhere in between.

I’m just wondering are we not all following the same guidelines?

We have specific guidelines that everyone should follow. clang-format implements a superset of those.

Here’s a counter-proposal: You should hook up a tool that automatically opens Phabricator reviews to clang-format the source code. It should do this at a very low rate. It may take years to get through everything. But, humans will look at the formatting changes, and where clang-format should be improved instead of changing the code, we’ll discover and classify those things. Formatted files can be noted, however, to incrementally build the clang-format test suite

That seems like a great idea actually!

mydeveloperday · July 1, 2020, 10:45am

@Timothy Keith,

Through this I’ve become aware of several sub-projects/libs that are either 100% clean or near 100% clean (flang,milr)

I have my own internal 5 million line code base which is 100% clang-formatted (including all C# and JS, unit tests), plus Firefox is also I think 100% converted, some helpful devs over there pull some patches and check for regressions if I add them as reviewers.

I always knew “polly” was mostly clean too, and actually one of the areas I already test against, (along with lib/Format obviously).

If we could get whole sub-projects 100% clean then it would be possible for me to notify them downstream when a clang-format change (for a bug fix) requires an update. But without that 100% status I’m always playing a game of checking every change, once we are at sub-project at 100% any change in formatting can be identified as a potential bug.

The current process I’m building is for the script that generates that html page to also generate a list of directories that are 100% clean, then use that as input to a test which validates clang-format, any clang-format replacements signify either recent unformatted changes or potentially bug fixes or regressions. (all of which in my view are worth highlighting to either clang-format or the owners of the subprojects)

Where sub teams can help is by making their projects 100% clean and helping to enforce/maintain that. We really try to not change LLVM default style but I am trying to work my way through the bugs and that from time to time does show differences. ideally I would run this regression on your sub-projects and then @mention your Phabricator project that inside review. (or send a review of the changes)

This is not about imposing style on anyone, It is purely about ensuring we don’t break the world as fixing bugs inevitably causes some instability. However one wrong move could have massive ripples through all the projects that use clang-format, the unit tests are excellent but regressions still slip through the net, and we already had a case where v10 had an issue which impacted others after it had been branched. I feel we need to ensure clang-format is run over millions of lines of code to catch bugs as early as possible. (and to root out new ones), it will be of huge benefit to clang-format if our own code base is heavily formatted so I really appreciate those who consider this worthwhile.

MyDeveloperDay

Topic		Replies	Views
[RFC] Semi-Automatic clang-format of files with low frequency LLVM Project	1	381	June 27, 2020
[RFC] Semi-Automatic clang-format of files with low frequency LLVM Dev List Archives	1	96	June 30, 2020
[RFC] Semi-Automatic clang-format of files with low frequency LLVM Dev List Archives	0	100	July 1, 2020
Keeping LLVM Clang-Format Clean Project Infrastructure	4	328	January 27, 2022
RFC: clang-format all the things! LLVM Project	51	1610	February 22, 2024

[RFC] Semi-Automatic clang-format of files with low frequency

Related topics