RFC: Extend clang-format to support more/all C-like languages

The context-free parser implemented in clang-format is capable of (or can easily be extended to) understand the structure of basically all C-like languages. A basic definition of C-like languages is the “Influenced” section of C’s Wikipedia page [1].

At most, this includes: AMPL, AWK, csh, C++, C–, C#, Objective-C, BitC, D, Go, Rust, Java, JavaScript, Limbo, LPC, Perl, PHP, Pike, Processing, Seed7, Verilog. Today, clang-format already supports C, C++, Objective-C and Objective-C++. Starting from there, it seems almost trivial to extend support to JavaScript and Java which only contain a small number of additional syntactical constructs and can be tokenized with Clang’s lexer. Eventually, we also imagine (and would love to see patches for) formatting C#, D, Go, Rust, and PHP based on their similar syntax and active usage. Maybe there are others the community would be interested in seeing?

The benefit of using the same format tool for all these jobs is that many of clang-format’s advanced formatting algorithms (e.g. Tex-like analysis of the entire solution space) are immediately available to the other languages.

Concrete proposal:

Start with JavaScript. Syntactically, it is very close to C++ and there are already different efforts going on to combine JavaScript and LLVM. Add an additional LanguageStandard (this flag already supports C++03 and C++11) in clang-format’s configuration and gate JavaScript specific formatting decisions (e.g. indentation of JavaScript’s namespace-equivalent on it).

To make clang-format more useful to the LLVM project itself, support for the tblgen language seems another worthy goal that can be achieved in the same way as JavaScript.

Thoughts? Comments? Concerns with this direction?

If this is in line with LLVM’s/Clangs’s roadmap, we’ll start working on the few features missing for formatting JavaScript and we should have rudimentary support towards the end of the year.

[1] http://en.wikipedia.org/wiki/C_(programming_language)

This sounds pretty interesting to me. The only concern I would have is that we don’t want Clang’s lexer to have to become a “Grand unified lexer for all languages”. The token rules of languages in these families are related, but different. Would it be better to have clang-format support using multiple different lexers?

-Chris

I think this is a great idea. clang-format is so useful and I’d love to be able to use it with other languges I use. I agree with Chris that lexing support for all these languages should be somehow separate from clang to prevent clang from becoming unmaintainably complex.

Daniel should give the definitive answer, but my feeling is that the
interesting commonality with all of these languages is that the lexing
rules (both token and comment) are more similar than different.

The Clang lexer is *really* nice, and I wouldn't want to see us ending up
with hand written lexers that largely duplicate the interesting logic of
Clang's.

I would suggest waiting to see. Essentially, if we try this out, and we end
up with a big or ugly patch to the lexer than doesn't very cleanly factor
out a minor difference for a particular language, then that'll be the sign
that we need to have a separate lexer and abstract across them in
clang-format. Until then, we re-use the existing lexer and make the minor
extensions needed.

What if the lexer was an overly large, non-abstract base class? Then, the derived classes can just override the tokens as needed.

(FWIW, I wouldn't try to design changes to the lexer in this thread, in the
abstract... If this is an interesting path to pursue, I suspect Daniel or
others should produce concrete proposed patches that enable the features
needed and minimize the pollution of the lexer with other languages...)

My gut feeling is that we won’t need to change the lexer. Bear in mind, that (same as with everything else in clang-format) we only need to understand the language good enough to format it. There might always be corner cases where we aren’t correct, but these are rare in practice.

In fact, I would like to go ahead and see whether we really hit the limit somewhere and if so, what the problems are. Once we have sufficient information, we can make a good decision on how to continue. Options would be allowing different lexers or post-processing the output of Clang’s lexer. I fully agree that we should not modify the lexer to accommodate other languages.

My gut feeling is that we won't need to change the lexer. Bear in mind,
that (same as with everything else in clang-format) we only need to
understand the language good enough to format it. There might always be
corner cases where we aren't correct, but these are rare in practice.

How do you intend to lex JavaScript's === and !== operators? Or the `var`
keyword? These aren't "corner cases" at all.

-- Sean Silva

My gut feeling is that we won't need to change the lexer. Bear in mind,
that (same as with everything else in clang-format) we only need to
understand the language good enough to format it. There might always be
corner cases where we aren't correct, but these are rare in practice.

How do you intend to lex JavaScript's === and !== operators? Or the `var`
keyword? These aren't "corner cases" at all.

What do you expect to be formatted differently because of those?

More specifically:

  • “var” is lexed just fined. It is the identifier lookup/parsing that you are thinking of and we can easily do that.
  • let ===/!== just be lexed as ==/!= and =… We can set the spacing/line break appropriately. As I mentioned. We don’t need to understand the language, we just need to format it correctly. Or just do the post-processing I was describing and merge the two tokens received from the lexer.

More specifically:
- "var" is lexed just fined. It is the identifier lookup/parsing that you
are thinking of and we can easily do that.
- let ===/!== just be lexed as ==/!= and =.. We can set the spacing/line
break appropriately. As I mentioned. We don't need to understand the
language, we just need to format it correctly. Or just do the
post-processing I was describing and merge the two tokens received from the
lexer.

While those don't seem like a big deal in terms of hacking around Clang's
lexer not being a javascript lexer, JavaScript's regex literals do seem
like they would be quite a bit of work. For example, you might have:

if ((m = /^(\s*)([a-zA-Z0-9._-]+)\s*=\s*/.exec(chunk))) {

As long as clang's lexer doesn't choke on stuff like this, you probably can
reconstruct it just fine with a bit of work. It just seems like a source of
a lot of complexity. As a side note, there are some edge cases where the
regex literal can contains something that clang will lex as a comment, e.g.
/[/*]/ or /[//]/, but I doubt those are very common and so not problematic.
Languages with "raw" string literals that aren't lexically identical to
C++'s will also present a similar problem (e.g. a URL in a raw string
literal will result in http:// interpreted as starting a comment).

It just seems like using one language's lexer for another language is a big
hack, and very prone to running into a "showstopper" problem. There are
perfectly good javascript lexers around (e.g. <
http://esprima.org/demo/parse.html&gt;, click on "Tokens"; check "Line and
column based" to get source location info and "include comments" for
comments). Could clang-format just have a component that accepts a stream
of tokens on stdin in some specified format and produces a list of
replacements?

-- Sean Silva

The context-free parser implemented in clang-format is capable of (or can
easily be extended to) understand the structure of basically all C-like
languages. A basic definition of C-like languages is the “Influenced”
section of C’s Wikipedia page [1].

At most, this includes: AMPL, AWK, csh, C++, C--, C#, Objective-C, BitC,
D, Go, Rust, Java, JavaScript, Limbo, LPC, Perl, PHP, Pike, Processing,
Seed7, Verilog. Today, clang-format already supports C, C++, Objective-C
and Objective-C++. Starting from there, it seems almost trivial to extend
support to JavaScript and Java which only contain a small number of
additional syntactical constructs and can be tokenized with Clang’s lexer.
Eventually, we also imagine (and would love to see patches for) formatting
C#, D, Go, Rust, and PHP based on their similar syntax and active usage.
Maybe there are others the community would be interested in seeing?

The benefit of using the same format tool for all these jobs is that many
of clang-format’s advanced formatting algorithms (e.g. Tex-like analysis of
the entire solution space) are immediately available to the other languages.

Concrete proposal:

Start with JavaScript. Syntactically, it is very close to C++ and there
are already different efforts going on to combine JavaScript and LLVM. Add
an additional LanguageStandard (this flag already supports C++03 and C++11)
in clang-format’s configuration and gate JavaScript specific formatting
decisions (e.g. indentation of JavaScript’s namespace-equivalent on it).

JavaScript's "namespace equivalent" is just an anonymous function, so I'm
not sure how you intend to detect this lexically.

To make clang-format more useful to the LLVM project itself, support for
the tblgen language seems another worthy goal that can be achieved in the
same way as JavaScript.

I've been really wanting something like clang-format for tablegen. By happy
coincidence AFAIK the only situation where tablegen is lexically different
from C++ (excluding keyword differences) is a special form of string
literal that contains only C++ code, and the delimiters for the string
literal are `[{` and `}]` which will be lexed and can be recognized, and
then clang-format could recursively format the C++ code.

Thoughts? Comments? Concerns with this direction?

If preliminary experiments show that it is feasible to work with Clang's
lexer and lexically different languages, then I think this probably makes
sense. It may just be easier (not to mention more correct) to be able to
plug in different lexers though.

-- Sean Silva

JavaScript has some extremely different syntax that is likely to play havoc with a straight C/C++ lexer. It also depends on exactly which variant of JavaScript you want to support--ES5? ES6? Mozilla's JS extensions? Support E4X as well?
1. Regular expressions. I don't recall off the top of my head, but I believe it boils down to "/ starts a regular expression if you're expecting an operand and is a division operator if you're not"--you'll need to do at least enough parsing to distinguish those two cases.
2. Array comprehensions (ES 6/Mozilla JS 1.8.5 enhancements): [x for (x in obj)], [x for each (x in obj)], [x for (x of obj)]. The middle is not in ES 6 (it's actually a holdover from E4X that sticks around because it was introduced well before the for-of statement was, and found relatively widespread use in Mozilla which made the JS people keep it around when we killed E4X), and I don't recall if the generator form (without enclosing brackets) is in ES 6 or not.
3. Generators: function*() { yield y; yield* x; }. I don't even know recommended style guides for the star-variant, as I only just retrofitted my code to have them two or three days ago.
4. Object literals:
var x = {
   get y() { return z; },
   x: 13,
   q: function () { return this.x; }
};
5. Semicolon insertion. Newlines can become semicolons in the right circumstances (the worst misfeature in JS, IMHO).
6. Reading the ES6 draft, it supports \u in IdentifierNames just like Java does.
7. Some other operators are in play. === and !== have been brought up/ there is also >>> and >>>= (like Java's operators), and ... (array spread operator).
8. Some ES6 features I haven't played with that may or may not have been added to some libraries yet: template strings, and there's also a =>-like notation for functions IIRC.

Regular expression literal support will definitely need different lexing paths than C/C++, although (excluding template strings and some E4X literals--the former of which is too new to be widely supported and the latter of which has already been ripped out from the only major engine that it) I think it is otherwise close enough to reuse a lot of the lexing capabilities of C-family languages. Just be forewarned that the shallow parsing that needs to be done for JS is likely to be rather different from that down for C/C++, even if their lexing streams look more or less similar.

Start with JavaScript. Syntactically, it is very close to C++ and there
are already different efforts going on to combine JavaScript and LLVM. Add
an additional LanguageStandard (this flag already supports C++03 and C++11)
in clang-format’s configuration and gate JavaScript specific formatting
decisions (e.g. indentation of JavaScript’s namespace-equivalent on it).

JavaScript has some extremely different syntax that is likely to play
havoc with a straight C/C++ lexer. It also depends on exactly which variant
of JavaScript you want to support--ES5? ES6? Mozilla's JS extensions?
Support E4X as well?
1. Regular expressions. I don't recall off the top of my head, but I
believe it boils down to "/ starts a regular expression if you're expecting
an operand and is a division operator if you're not"--you'll need to do at
least enough parsing to distinguish those two cases.

For reference, the entire syntax is split down the middle between "/ means
division" vs "/ means regex". Spec reference is <Annotated ES5;
the two goal tokens of the grammar are InputElementDiv and
InputElementRegExp.

2. Array comprehensions (ES 6/Mozilla JS 1.8.5 enhancements): [x for (x in
obj)], [x for each (x in obj)], [x for (x of obj)]. The middle is not in ES
6 (it's actually a holdover from E4X that sticks around because it was
introduced well before the for-of statement was, and found relatively
widespread use in Mozilla which made the JS people keep it around when we
killed E4X), and I don't recall if the generator form (without enclosing
brackets) is in ES 6 or not.
3. Generators: function*() { yield y; yield* x; }. I don't even know
recommended style guides for the star-variant, as I only just retrofitted
my code to have them two or three days ago.
4. Object literals:
var x = {
  get y() { return z; },
  x: 13,
  q: function () { return this.x; }
};

This is the kind of feature that I'm least concerned with, since it is
lexically identical to C++ (modulo a few keywords that will be interpreted
as identifiers).

5. Semicolon insertion. Newlines can become semicolons in the right
circumstances (the worst misfeature in JS, IMHO).

This isn't a big deal. Basically there's a fixed set of lookahead tokens
and if the next token is one of them, then the statement continues,
otherwise it ends ("semicolon is inserted"). (Of course, open parentheses
override this but I'm sure clang-format already handles that case just
fine).

6. Reading the ES6 draft, it supports \u in IdentifierNames just like Java
does.
7. Some other operators are in play. === and !== have been brought up/
there is also >>> and >>>= (like Java's operators), and ... (array spread
operator).
8. Some ES6 features I haven't played with that may or may not have been
added to some libraries yet: template strings, and there's also a =>-like
notation for functions IIRC.

Regular expression literal support will definitely need different lexing
paths than C/C++, although (excluding template strings and some E4X
literals--the former of which is too new to be widely supported and the
latter of which has already been ripped out from the only major engine that
it) I think it is otherwise close enough to reuse a lot of the lexing
capabilities of C-family languages. Just be forewarned that the shallow
parsing that needs to be done for JS is likely to be rather different from
that down for C/C++, even if their lexing streams look more or less similar.

If this goes through, I would expect clang-format's lambda formatting to be
top-notch, since about half of all javascript code basically consists of
lambdas :slight_smile:

-- Sean Silva

I think this is a great direction so long as any lexer changes required are small, relatively isolated, and don’t impact performance in our normal compilation path.

  • Doug