Repeated clang-format'ting keeps changing code / use of clang-format in CI

Hi!

I'm trying to integrate clang-format version 9 with CI in a way that the
CI run only passes if applying clang-format yields the exact same
code, i.e. "git diff --exit-code" returns 0. The idea is that every
pull request would only pass if clang-format

When trying to put that approach to action with libexpat [1]
I noticed that multiple runs to clang-format do not seem to produce the
same code, at least not with the two version of Clang 9 that I tested
[2], and at least not with libexpat code.

I wonder if that's a known problem, if it's fixed in later versions
of clang-format, if you are aware of workarounds, or if I just need
to say goodbye to combining clang-format and CI for stable style checking.

Thanks in advance!

Best

Sebastian

[1] https://github.com/libexpat/libexpat/pull/293
[2] 9.0.0.9999 commit 28c954cf961a846789a972a8ed179b7108244ae7

Hello again,

a quick update for anyone who wondered the same and ran into my previous
e-mail:

For a workaround, troublesome code can be wrapped by…

  // clang-format off
  [..]
  // clang-format on

…for C++ or…

  /* clang-format off */
  [..]
  /* clang-format on */

…for C to disable re-formatting of that section of code. That way,
clang-format produces stable results and becomes suitable for during CI.

Best

Sebastian

In general, I would expect any instability to be a bug - no doubt there are bugs, but if you can reduce the example to something easy to communicate in a bug report, etc, and file it, I expect some folks would be interested in fixing it.

clang-cl is supposed to reach a fixpoint in one iteration. If it doesn’t, that’s a bug. If you can, please post a reduced repro case :slight_smile:

Hi Nico,

that's great to hear. I have attached a script to reproduce the issue
from public code of libexpat 2.2.7 as well as the second iteration diff
I get from clang-format version 10.0.0
(/var/tmp/portage/sys-devel/clang-10.0.0.9999/work/x/y/clang-10.0.0.9999
c0048be7ff340ebba3092e95d82147bc9928b909) of Gentoo.

Best

Sebastian

clang-format-fixpoint-issue-demo.sh.txt (428 Bytes)

expat-2-2-7-clang-format-second-round.diff (25 KB)

Hi, this is the smallest reproducer I could find with the default clang-format configuration.

$ clang-format --version
TRC clang-format version 8.0.0 (tags/RELEASE_800/final) (based on LLVM 8.0.0)
$ clang-format test.c > test.c.1 && clang-format test.c.1 > test.c.2 && diff test.c.1 test.c.2
1,4c1,5
< const char *expected = XCS(“ABCDEFGHIJKLMNOPABCDEFGHIJKLMNOPABCDEFGHIJKLMNOPAB”
< “C”) XCS(“ABCDEFGHIJKLMNOPABCDEFGHIJKLMNOPABCDEFGHIJ”
< “KLMNOPABC”) XCS(
< “ABCDEFGHIJKLMNOPABCDEFGHIJKLMNOPABCDEFGHIJKLMNOPABC”)

test.c (970 Bytes)

Thanks!

Here’s a (not minimal) reduction of the first issue; it looks like the other issues are basically the same thing:

$ cat fmt.c
void cdataSectionTok()
{
switch (1) {
#define LEAD_CASE(n)
case BT_LEAD ## n:
break
LEAD_CASE(2) LEAD_CASE(3) LEAD_CASE(4)
#undef LEAD_CASE
default:
break;
}
}

$ out/gn/bin/clang-format fmt.c
void cdataSectionTok() {
switch (1) {
#define LEAD_CASE(n)
case BT_LEAD##n:
break
LEAD_CASE(2)
LEAD_CASE(3) LEAD_CASE(4)
#undef LEAD_CASE
default : break;
}
}

As you can see by the weird “default : break” at the end, clang-format gets confused about the whole switch statement. It looks like the macros confuse it. Just adding semicolons after it is sufficient to unconfuse it:

$ cat fmt.c
void cdataSectionTok()
{
switch (1) {
#define LEAD_CASE(n)
case BT_LEAD ## n:
break
LEAD_CASE(2); LEAD_CASE(3); LEAD_CASE(4);
#undef LEAD_CASE
default:
break;
}
}

$ out/gn/bin/clang-format fmt.c
void cdataSectionTok() {
switch (1) {
#define LEAD_CASE(n)
case BT_LEAD##n:
break
LEAD_CASE(2);
LEAD_CASE(3);
LEAD_CASE(4);
#undef LEAD_CASE
default:
break;
}
}

If you’re able to change expat’s source, this might be a good approach.

Else, you can explicitly tell clang-format the name of macros that should be handled as statements in your .clang-format file like so:

StatementMacros: [‘LEAD_CASE’]

That helps with LEAD_CASE. (clang-format should still converge within one round, but it’s going to converge to something bad without either of the two things I suggest, so it wouldn’t help you much).

Like Raphaël’s mail says, XCS is slightly different because it’s not a statement. clang-format has special logic for formatting sequences of string literals, and that doesn’t fire if each literal is surrounded by a macro.

You can fix this by doing XCS("foo" "bar") (with a linebreak in between foo and bar) instead of XCS("foo") XCS("bar"). Semantically they do the same thing as far as I can tell, but clang-format does much better with the first form. Having done this transformation locally, this is arguably easier to read for humans too.

There isn’t a .clang-format setting for this case as far as I know.

(Again, clang-format should reach a fixed point in one hop, but again the one hop gives you pretty bad formatting so fixing that bug wouldn’t help you all that much.)

With these two changes, clang-format does reach an end state in one step, and its output (for the two cases I looked at) looks good too.

Hope this helps!

Hello Nico,

As you can see by the weird "default : break" at the end, clang-format
gets confused about the whole switch statement. It looks like the macros
confuse it. Just adding semicolons after it is sufficient to unconfuse it:

[..]
LEAD_CASE(2); LEAD_CASE(3); LEAD_CASE(4);
[..]

If you're able to change expat's source, this might be a good approach.

interesting idea!

Else, you can explicitly tell clang-format the name of macros that
should be handled as statements in your .clang-format file like so:

StatementMacros: ['LEAD_CASE']

Good to know!

You can fix this by doing `XCS("foo" "bar")` (with a linebreak in
between foo and bar) instead of `XCS("foo") XCS("bar")`. Semantically
they do the same thing as far as I can tell, but clang-format does much
better with the first form.

Things get interesting here because XCS is either

  # define XCS(s) _XCS(s)
  # define _XCS(s) L ## s

or

  # define XCS(s) s

depending of macro XML_UNICODE_WCHAR_T to turn string literals into
wide or narrow string literals. For the first version, XCS("foo" "bar")
results in invalid C syntax, mixing wide L"foo" with narrow "bar". As a
result, I needed to turn off BreakStringLiterals altogether for now.

Best

Sebastian

Hello!

I have found existing bug "Inconsistent output when running clang-format
twice" [1] now and attached the instructions to my original case to it.

Best

Sebastian

[1] https://bugs.llvm.org/show_bug.cgi?id=42509

Hello Nico,

As you can see by the weird “default : break” at the end, clang-format
gets confused about the whole switch statement. It looks like the macros
confuse it. Just adding semicolons after it is sufficient to unconfuse it:

[…]
LEAD_CASE(2); LEAD_CASE(3); LEAD_CASE(4);
[…]

If you’re able to change expat’s source, this might be a good approach.

interesting idea!

Else, you can explicitly tell clang-format the name of macros that
should be handled as statements in your .clang-format file like so:

StatementMacros: [‘LEAD_CASE’]

Good to know!

You can fix this by doing XCS("foo" "bar") (with a linebreak in
between foo and bar) instead of XCS("foo") XCS("bar"). Semantically
they do the same thing as far as I can tell, but clang-format does much
better with the first form.

Things get interesting here because XCS is either

define XCS(s) _XCS(s)

define _XCS(s) L ## s

or

define XCS(s) s

depending of macro XML_UNICODE_WCHAR_T to turn string literals into
wide or narrow string literals. For the first version, XCS(“foo” “bar”)
results in invalid C syntax, mixing wide L"foo" with narrow “bar”. As a
result, I needed to turn off BreakStringLiterals altogether for now.

FWIW, C++11 added “If one string-literal has no encoding-prefix, it is treated as a string-literal of the same encoding-prefix as the other operand.” to [lex.string]p13, so in C++11 L"foo" "bar" is the same as L"foo" L"bar". But granted, it’s implementation-defined in C++ before C++11.

The C standard says “If any of the tokens are wide string literal tokens, the resulting multibyte character sequence is treated as a wide string literal; otherwise, it is treated as a character string literal.” in 6.4.5 String Literals p4. C89 still said “If a character string literal token is adjacent to a wide string literal token, the behavior is undefined”, I think this changed in C99.

So if you need to support compilers that don’t implement the C99 behavior (which you likely do need to do :slight_smile: ), then you’re right.

We should probably have a bug for making clang-format handle string literals surrounded by a macro call without semicolons that aren’t in StatementMacros the same as bare string literals.