[clang-format] Token Annotation limitations

I’ve recently been looking into a bug in clang format, that incorrectly treats * as a TT_PointerOrReference

void f() { operator+(a, b *b); }

Actually the misinterpretation of * and & as BinaryOperators or PointerOrReference is a major source of bugs for clang-format, this is because we don’t have semantic information and often we are looking very myopically at adjacent tokens and there just isn’t enough context to disambiguate.

So whilst we see b * b this a easily (b x b)

This is actually just

tok::identifier->tok::star->tok::identifier

Which is indistinguishable from

MyClass *b

which I think we’d all see as (class ptr b), but in all cases we have no additional information, other than perhaps scanning up and down the line to find something relevant.

This literally drops out the bottom of determineStarAmpUsage() with a
return TT_PointerOrReference;

For the bug I’m looking at, given the following example the first f() puts the * as a pointer when actually its a multiply;

void f() { operator+(Bar, Foo *Foo); }

class A {
void operator+(Bar, Foo *Foo);
}

Effectively these two instances of operator appears as the same with almost no difference in terms of tokens,

Looking at the token annotations, there is almost nothing that distinguishes between the two.

AnnotatedTokens(L=0):
M=0 C=0 T=Unknown S=1 F=0 B=0 BK=0 P=0 Name=void L=4 PPK=2 FakeLParens= FakeRParens=0 II=0x1baa688a118 Text=‘void’
M=0 C=1 T=FunctionDeclarationName S=1 F=0 B=0 BK=0 P=80 Name=identifier L=6 PPK=2 FakeLParens= FakeRParens=0 II=0x1baa68bd9c8 Te
xt=‘f’
M=0 C=0 T=Unknown S=0 F=0 B=0 BK=0 P=23 Name=l_paren L=7 PPK=2 FakeLParens= FakeRParens=0 II=0x0 Text=‘(’
M=0 C=0 T=Unknown S=0 F=0 B=0 BK=0 P=140 Name=r_paren L=8 PPK=2 FakeLParens= FakeRParens=0 II=0x0 Text=‘)’
M=0 C=0 T=FunctionLBrace S=1 F=0 B=0 BK=1 P=23 Name=l_brace L=10 PPK=2 FakeLParens= FakeRParens=0 II=0x0 Text=‘{’

Seems reasonable to me. I also think we should add and keep more information in the unwrapped line parser.

Kind regards,
Björn.

Here is an example of where I annotate a function declaration after the fact

AnnotatedTokens(L=1):
M=0 C=0 T=FunctionDeclarationReturnType S=1 F=0 B=0 BK=0 P=0 Name=void L=4 PPK=2 FakeLParens= FakeRParens=0 II=0x1f9f89f7628 Text=‘void’
M=0 C=1 T=FunctionDeclarationName S=1 F=0 B=0 BK=0 P=80 Name=identifier L=8 PPK=2 FakeLParens= FakeRParens=0 II=0x1f9f89f5358 Text=‘Bar’
M=0 C=0 T=FunctionDeclarationLParan S=0 F=0 B=0 BK=0 P=23 Name=l_paren L=9 PPK=2 FakeLParens= FakeRParens=0 II=0x0 Text=‘(’
M=0 C=1 T=FunctionDeclarationParameterType S=0 F=0 B=0 BK=0 P=140 Name=identifier L=12 PPK=2 FakeLParens=1/ FakeRParens=0 II=0x1f9f89f5358 Text=‘Bar’
M=0 C=0 T=FunctionDeclarationParameterComma S=0 F=0 B=0 BK=0 P=41 Name=comma L=13 PPK=2 FakeLParens= FakeRParens=0 II=0x0 Text=‘,’
M=0 C=1 T=Unknown S=1 F=0 B=0 BK=0 P=41 Name=const L=19 PPK=2 FakeLParens= FakeRParens=0 II=0x1f9f89f70c8 Text=‘const’
M=0 C=0 T=FunctionDeclarationParameterType S=1 F=0 B=0 BK=0 P=43 Name=int L=23 PPK=2 FakeLParens= FakeRParens=0 II=0x1f9f89f7340 Text=‘int’
M=0 C=1 T=PointerOrReference S=1 F=0 B=0 BK=0 P=230 Name=star L=25 PPK=2 FakeLParens= FakeRParens=0 II=0x0 Text=‘
M=0 C=1 T=StartOfName S=0 F=0 B=0 BK=0 P=240 Name=identifier L=26 PPK=2 FakeLParens= FakeRParens=0 II=0x1f9f89f53e8 Text=‘b’
M=0 C=1 T=PointerOrReference S=1 F=0 B=0 BK=0 P=230 Name=star L=28 PPK=2 FakeLParens= FakeRParens=0 II=0x0 Text='

M=0 C=0 T=FunctionDeclarationParameterComma S=0 F=0 B=0 BK=0 P=54 Name=comma L=29 PPK=2 FakeLParens= FakeRParens=0 II=0x0 Text=‘,’
M=0 C=1 T=FunctionDeclarationParameterType S=1 F=0 B=0 BK=0 P=41 Name=int L=33 PPK=2 FakeLParens= FakeRParens=0 II=0x1f9f89f7340 Text=‘int’
M=0 C=1 T=StartOfName S=1 F=0 B=0 BK=0 P=240 Name=identifier L=35 PPK=2 FakeLParens= FakeRParens=1 II=0x1f9f89f5418 Text=‘c’
M=0 C=0 T=FunctionDeclarationRParan S=0 F=0 B=0 BK=0 P=43 Name=r_paren L=36 PPK=2 FakeLParens= FakeRParens=0 II=0x0 Text=‘)’
M=0 C=0 T=Unknown S=0 F=0 B=0 BK=0 P=23 Name=semi L=37 PPK=2 FakeLParens= FakeRParens=0 II=0x0 Text=‘;’

This doesn’t fix my exact issue, but if we could perform this kind of annotation across all code it might help to break down some of the ambiguities

MyDeveloperDay