How does clang-format parse snippets?

Hi,

I’m interested in using clang to refactor snippets of C++ for which I can’t produce an AST. AFAIK this precludes the use of clang tools like clang-check and I wondered if clang-format could be used instead as it doesn’t seem to require the production of an AST. I don’t quite understand how clang-format works and have a couple of questions:

clang-format is token-based, it sees the raw token stream (without even running the preprocessor, which is why it doesn’t need -I and -D flags). clang’s cc1 flag -dump-raw-tokens shows you what clang-format sees as input. Note how it #if 0 gets printed instead of evaluated:

$ cat foo.cc
#if 0
asdf
#endif
int f();

$ bin/clang -c -Xclang -dump-raw-tokens foo.cc
hash ‘#’ [StartOfLine] Loc=foo.cc:1:1
raw_identifier ‘if’ Loc=foo.cc:1:2
unknown ’ ’ Loc=foo.cc:1:4
numeric_constant ‘0’ Loc=foo.cc:1:5
unknown ’
’ Loc=foo.cc:1:6
raw_identifier ‘asdf’ [StartOfLine] Loc=foo.cc:2:1
unknown ’
’ Loc=foo.cc:2:5
hash ‘#’ [StartOfLine] Loc=foo.cc:3:1
raw_identifier ‘endif’ Loc=foo.cc:3:2
unknown ’

’ Loc=foo.cc:3:7
raw_identifier ‘int’ [StartOfLine] Loc=foo.cc:5:1
unknown ’ ’ Loc=foo.cc:5:4
raw_identifier ‘f’ Loc=foo.cc:5:5
l_paren ‘(’ Loc=foo.cc:5:6
r_paren ‘)’ Loc=foo.cc:5:7
semi ‘;’ Loc=foo.cc:5:8
unknown ’
’ Loc=foo.cc:5:9

clang-format then has a bunch of heuristics to decide if a * b is a multiplication or a declaration, but since it doesn’t build an AST as you say, it doesn’t know if “a” in two different places refer to the same variable. So in general it can’t be used for most automated refactorings, since you usually need ASTs for that.

(clang-format works great for formatting the output of an automated refactoring though.)

Thanks for this, it confirms what I expected from the behaviour of clang-format. I think that for our use-case we would need more than raw tokens (although I may look into doing it all with tokens only). I guess the next question I would have is:

  • Is there currently some way to sensibly produce partial ASTs?

I understand that clang uses a recursive descent parser and I was wondering if there is any sensible way to halt the descent – I’m not really sure if the grammar of C++ would even allow for such a thing to make sense. I’ve been looking online for some kind of document explaining how the clang parser actually works and the stages it goes through from source > token > AST; I’ve found it difficult to get an understanding by looking at the clang source code.

Thanks,

Stuart