Design: clang-format

Hi,

we’re working on the design of clang-format; there are quite a few open questions, but I’d rather get some early feedback to see whether we’re completely off track somewhere.

The main doc is on google docs:
https://docs.google.com/document/d/1gpckL2U_6QuU9YW2L1ABsc4Fcogn5UngKk7fE5dDOoA/edit

For those of you who prefer good old email, here is a copy of the current state. Feel free to add comments in either.
# Design: clang-formatThis document contains a design proposal for a clang-format tool, which allows C++ developers to automatically format their code independently of their development environment.
## ContextWhile many other languages have auto-formatters available, C++ is still lacking a tool that fits the needs of the majority of C++ programmers. Note that when we talk about formatting as part of this document, we mean both the problem of indentation (which has been largely solved independently by regexp-based implementations in editors / IDEs) and line breaking, which proves to be a harder problem.

There are multiple challenges to formatting C++ code:

  • a vast number of different coding styles has evolved over time
  • many projects value consistency over conformance and dislike style-only changes, thus making it important to be able to work with code that is not written according to the most current style guide
  • macros need to be handled properly
  • it should be possible to format code that is not yet syntactically correct

Goals- Format a whole file according to a configuration

  • Format a part of a file according to a configuration
  • Format a part of a file while being consistent as best as possible with the rest of the file, while falling back to a configuration for options that cannot be deduced from the current file
  • Integrating with editors so that you can just type away until you’re far past the column limit, and then hit a key and have the editor layout the code for you, including placing the right line breaks

Non-goals- Indenting code while you type; this is a much simpler problem, but has even stronger performance requirements - the current editors should be good enough, and we’ll allow new workflows that don’t ever require the user to break lines

  • The only lexical elements clang-format should touch are: whitespaces, string-literals and comments. Any other changes ranging from ordering includes to removing superfluous paranthesis are not in the scope of this tool.
  • Per-file configuration: be able to annotate a file with a style which it adheres to (?)

Code locationClang-format is a very basic tool, so it might warrant living in clang mainstream. On the other hand it would also fit nicely with other clang refactoring tools. TODO: Where do we want clang-format to live?

## Parsing approachThe key consideration is whether clang-format can be based purely on a lexer, or whether it needs type information, and we need the full AST.

We believe that we will need the full AST information to correctly indent code, break lines, and fix whitespace within a line.

Examples:

AST-dependent indentation:
callFunction(foo<something,
^ line up here, if foo is a template name
^ line up here otherwise

AST-dependent line breaking:
Detecting that ‘*’ is an binary operator in this case requires parsing; if it is a binary operator, we want to line-break after it, if it is a unary operator, we want to prevent line breaking

result = variable1 * variable2;

AST-dependent whitespace inside lines:
a * b;
^ Binary operator or pointer declaration?
a & f();
^ Binary operator or function declaration?

Challenge: Preprocessor
Not every line in a program is covered by the AST - for example, there are unused macro definitions, various preprocessor directives, #ifdef’ed out code, etc.

We will at least need some form of lexing approach for the parts of a source file that cannot be correctly indented / line broken by looking at the AST.

## AlgorithmVisit all nodes on the AST; for each node that is part of a macro expansion, consider all locations taking part in that macro expansion. If the location is within the range that need to be indented, look at the code at the location, the rules around the node, and adjust whitespace as necessary. If the node starts a line, adjust the indent; if a node overflows the line, break the line. TODO: figure out what to do with the lines that are not visited that way.
## ConfigurationTo support a majority of developers, being able to configure the desired style is key. We propose using a YAML configuration file, as there’s already a YAML parser readily available in LLVM. Proposals for more specific ideas welcome.
## Style deductionWhen changing the format of code that does not conform to a given style configuration, we will optionally try to deduce style options from the file first, and fall back to the configured layout when there was no clear style deducible from the context.
TODO: Detailed design ideas.
## InterfaceThis is a strawman. Please shoot down.

Command line interface:
Command line interfaces allow easy integration with existing tools and editors.

USAGE: clang-format [ […]] [-- list of command line arguments to the parser]

: Specifies a code range to be reformatted; if no code range is given, assume the whole file.

Code level interface:
Reformatting source code is also a prerequisite for automated refactoring tools. We want to be able to integrate the reformatting as a post-processing step on top of other code transformations to make sure as little human intervention is needed as possible.
## CompetitionTODO: List other formatting tools we’re aware of and how well they work

Hi,

we’re working on the design of clang-format; there are quite a few open questions, but I’d rather get some early feedback to see whether we’re completely off track somewhere.

Wow, having something like this would be great!

For those of you who prefer good old email, here is a copy of the current state.

+1 thanks :slight_smile:

## Context****While many other languages have auto-formatters available, C++ is still lacking a tool that fits the needs of the majority of C++ programmers. Note that when we talk about formatting as part of this document, we mean both the problem of indentation (which has been largely solved independently by regexp-based implementations in editors / IDEs) and line breaking, which proves to be a harder problem.

Also variable naming, use of #includes, etc? How much of http://llvm.org/docs/CodingStandards.html is realistically enforcable/detectable?

## Goals- Format a whole file according to a configuration

  • Format a part of a file according to a configuration
  • Format a part of a file while being consistent as best as possible with the rest of the file, while falling back to a configuration for options that cannot be deduced from the current file
  • Integrating with editors so that you can just type away until you’re far past the column limit, and then hit a key and have the editor layout the code for you, including placing the right line breaks

Some wishlist items from me:

  • A “enforcer” mode that could be used in a post-commit script to find violations of the style.
  • A “scanner” mode that could be used to scan a corpus of existing code to find the dominant style, instead of having to manually configure a thousand arguments like indent.

Non-goals- Indenting code while you type; this is a much simpler problem, but has even stronger performance requirements - the current editors should be good enough, and we’ll allow new workflows that don’t ever require the user to break lines

Make sense, this is a different problem.

- The only lexical elements clang-format should touch are: whitespaces, string-literals and comments. Any other changes ranging from ordering includes to removing superfluous paranthesis are not in the scope of this tool.
- Per-file configuration: be able to annotate a file with a style which it adheres to (?)

If successful, the tool will probably be feature crept to support these. I think it is completely sensible to subset these out from any initial implementation though: best to solve some small problems well (and then grow in scope) than to try to solve all problems and never got to a point where it is useful.

Code locationClang-format is a very basic tool, so it might warrant living in clang mainstream. On the other hand it would also fit nicely with other clang refactoring tools. TODO: Where do we want clang-format to live?

No strong feeling.

## Parsing approachThe key consideration is whether clang-format can be based purely on a lexer, or whether it needs type information, and we need the full AST.

We believe that we will need the full AST information to correctly indent code, break lines, and fix whitespace within a line.

The major tradeoff here is that requiring an AST “requires” valid code and information on how to simulate the build. If you can use just the lexer, then you can run on a random header file in isolation.

Perhaps it is possible to subset and layer things so that some stuff works with just the lexer (e.g. 80 column detection) but other stuff requires more integration with AST and build info?

## ConfigurationTo support a majority of developers, being able to configure the desired style is key. We propose using a YAML configuration file, as there’s already a YAML parser readily available in LLVM. Proposals for more specific ideas welcome.

Makes sense to me.

## Style deductionWhen changing the format of code that does not conform to a given style configuration, we will optionally try to deduce style options from the file first, and fall back to the configured layout when there was no clear style deducible from the context.

+100 :slight_smile:

-Chris

Hi,

we’re working on the design of clang-format; there are quite a few open questions, but I’d rather get some early feedback to see whether we’re completely off track somewhere.

The main doc is on google docs:
https://docs.google.com/document/d/1gpckL2U_6QuU9YW2L1ABsc4Fcogn5UngKk7fE5dDOoA/edit

For those of you who prefer good old email, here is a copy of the current state. Feel free to add comments in either.
# Design: clang-formatThis document contains a design proposal for a clang-format tool, which allows C++ developers to automatically format their code independently of their development environment.
## ContextWhile many other languages have auto-formatters available, C++ is still lacking a tool that fits the needs of the majority of C++ programmers. Note that when we talk about formatting as part of this document, we mean both the problem of indentation (which has been largely solved independently by regexp-based implementations in editors / IDEs) and line breaking, which proves to be a harder problem.

There are multiple challenges to formatting C++ code:

  • a vast number of different coding styles has evolved over time
  • many projects value consistency over conformance and dislike style-only changes, thus making it important to be able to work with code that is not written according to the most current style guide
  • macros need to be handled properly
  • it should be possible to format code that is not yet syntactically correct

Goals- Format a whole file according to a configuration

  • Format a part of a file according to a configuration
  • Format a part of a file while being consistent as best as possible with the rest of the file, while falling back to a configuration for options that cannot be deduced from the current file
  • Integrating with editors so that you can just type away until you’re far past the column limit, and then hit a key and have the editor layout the code for you, including placing the right line breaks

Non-goals- Indenting code while you type; this is a much simpler problem, but has even stronger performance requirements - the current editors should be good enough, and we’ll allow new workflows that don’t ever require the user to break lines

  • The only lexical elements clang-format should touch are: whitespaces, string-literals and comments. Any other changes ranging from ordering includes to removing superfluous paranthesis are not in the scope of this tool.

Oh…

I have 2 remarks here.

  1. The position of const and volatile qualifiers.

C++ allows having them either before or after the type they qualify (at the lower level). LLVM recommends putting them before (looks more English I guess) while I have seen other guides (and I prefer) systematically putting them after (for consistency, and I am French anyway!).

  1. The addition/removal of brackets for inline blocks

In C++, an if, else, for, while (not sure about do while) can be followed either by a block (with {}) or a single statement. Once again, purely a matter of style. LLVM recommends not putting them for example.

It seems to me that both would fit perfectly into a style formatter.

- Per-file configuration: be able to annotate a file with a style which it adheres to (?)

Perhaps a per-folder configuration file (and naturally inheriting from the parent folder if none available). And the ability to specialize the style for a few files within that configuration file, though it seems a bit overkill to go down to that level of details.

Hi,

we’re working on the design of clang-format; there are quite a few open questions, but I’d rather get some early feedback to see whether we’re completely off track somewhere.

Wow, having something like this would be great!

For those of you who prefer good old email, here is a copy of the current state.

+1 thanks :slight_smile:

## Context****While many other languages have auto-formatters available, C++ is still lacking a tool that fits the needs of the majority of C++ programmers. Note that when we talk about formatting as part of this document, we mean both the problem of indentation (which has been largely solved independently by regexp-based implementations in editors / IDEs) and line breaking, which proves to be a harder problem.

Also variable naming, use of #includes, etc? How much of http://llvm.org/docs/CodingStandards.html is realistically enforcable/detectable?

I think we want 2 tools:

  1. The one proposed here, which deals with formatting; ordering includes is a corner case, doing static analysis to figure out iwyu or reverse-iwyu style rules is a clear non-gloal
  2. What I call a “lint-style” tool, which will start right where clang-format leaves off and go deep into static analysis; the input format here I’d imagine to be more like “patterns” or “rules”, whereas for clang-format I’m basically imagining a lot of bool values and some numbers for the configuration.

## Goals- Format a whole file according to a configuration

  • Format a part of a file according to a configuration
  • Format a part of a file while being consistent as best as possible with the rest of the file, while falling back to a configuration for options that cannot be deduced from the current file
  • Integrating with editors so that you can just type away until you’re far past the column limit, and then hit a key and have the editor layout the code for you, including placing the right line breaks

Some wishlist items from me:

  • A “enforcer” mode that could be used in a post-commit script to find violations of the style.

That should fall out naturally.

  • A “scanner” mode that could be used to scan a corpus of existing code to find the dominant style, instead of having to manually configure a thousand arguments like indent.

Definitely an interesting idea, and something to keep in mind - I don’t know whether that would be one of the highest prio goals, but we should make it possible architecture-wise.

Non-goals- Indenting code while you type; this is a much simpler problem, but has even stronger performance requirements - the current editors should be good enough, and we’ll allow new workflows that don’t ever require the user to break lines

Make sense, this is a different problem.

- The only lexical elements clang-format should touch are: whitespaces, string-literals and comments. Any other changes ranging from ordering includes to removing superfluous paranthesis are not in the scope of this tool.
- Per-file configuration: be able to annotate a file with a style which it adheres to (?)

If successful, the tool will probably be feature crept to support these. I think it is completely sensible to subset these out from any initial implementation though: best to solve some small problems well (and then grow in scope) than to try to solve all problems and never got to a point where it is useful.

I also think we can give sensible alternatives to some…

Code locationClang-format is a very basic tool, so it might warrant living in clang mainstream. On the other hand it would also fit nicely with other clang refactoring tools. TODO: Where do we want clang-format to live?

No strong feeling.

## Parsing approachThe key consideration is whether clang-format can be based purely on a lexer, or whether it needs type information, and we need the full AST.

We believe that we will need the full AST information to correctly indent code, break lines, and fix whitespace within a line.

The major tradeoff here is that requiring an AST “requires” valid code and information on how to simulate the build. If you can use just the lexer, then you can run on a random header file in isolation.

Perhaps it is possible to subset and layer things so that some stuff works with just the lexer (e.g. 80 column detection) but other stuff requires more integration with AST and build info?

The very first draft I had that I didn’t send to the list actually had this one question at its core: how much can we do with the lexer only?
The problem is that I think we’ll not be able to do sufficiently better than standard-regexp-based solutions that it’s worth the effort. Even indenting needs types (as Richard pointed out) when templates are involved.

## ConfigurationTo support a majority of developers, being able to configure the desired style is key. We propose using a YAML configuration file, as there’s already a YAML parser readily available in LLVM. Proposals for more specific ideas welcome.

Makes sense to me.

## Style deductionWhen changing the format of code that does not conform to a given style configuration, we will optionally try to deduce style options from the file first, and fall back to the configured layout when there was no clear style deducible from the context.

+100 :slight_smile:

Thanks for you input!
/Manuel

Hi,

we’re working on the design of clang-format; there are quite a few open questions, but I’d rather get some early feedback to see whether we’re completely off track somewhere.

The main doc is on google docs:
https://docs.google.com/document/d/1gpckL2U_6QuU9YW2L1ABsc4Fcogn5UngKk7fE5dDOoA/edit

For those of you who prefer good old email, here is a copy of the current state. Feel free to add comments in either.
# Design: clang-formatThis document contains a design proposal for a clang-format tool, which allows C++ developers to automatically format their code independently of their development environment.
## ContextWhile many other languages have auto-formatters available, C++ is still lacking a tool that fits the needs of the majority of C++ programmers. Note that when we talk about formatting as part of this document, we mean both the problem of indentation (which has been largely solved independently by regexp-based implementations in editors / IDEs) and line breaking, which proves to be a harder problem.

There are multiple challenges to formatting C++ code:

  • a vast number of different coding styles has evolved over time
  • many projects value consistency over conformance and dislike style-only changes, thus making it important to be able to work with code that is not written according to the most current style guide
  • macros need to be handled properly
  • it should be possible to format code that is not yet syntactically correct

Goals- Format a whole file according to a configuration

  • Format a part of a file according to a configuration
  • Format a part of a file while being consistent as best as possible with the rest of the file, while falling back to a configuration for options that cannot be deduced from the current file
  • Integrating with editors so that you can just type away until you’re far past the column limit, and then hit a key and have the editor layout the code for you, including placing the right line breaks

Non-goals- Indenting code while you type; this is a much simpler problem, but has even stronger performance requirements - the current editors should be good enough, and we’ll allow new workflows that don’t ever require the user to break lines

  • The only lexical elements clang-format should touch are: whitespaces, string-literals and comments. Any other changes ranging from ordering includes to removing superfluous paranthesis are not in the scope of this tool.

Oh…

I have 2 remarks here.

  1. The position of const and volatile qualifiers.

C++ allows having them either before or after the type they qualify (at the lower level). LLVM recommends putting them before (looks more English I guess) while I have seen other guides (and I prefer) systematically putting them after (for consistency, and I am French anyway!).

This one sounds like it’s in scope, although I don’t know how far up on the prio list it will be … (meaning: probably not very high :wink:

  1. The addition/removal of brackets for inline blocks

In C++, an if, else, for, while (not sure about do while) can be followed either by a block (with {}) or a single statement. Once again, purely a matter of style. LLVM recommends not putting them for example.

I think this is out of scope, but it’s definitely borderline. But I’m not sure. Let’s solve the core questions first and figure out details inside the issue tracker later :wink:

It seems to me that both would fit perfectly into a style formatter.

- Per-file configuration: be able to annotate a file with a style which it adheres to (?)

Perhaps a per-folder configuration file (and naturally inheriting from the parent folder if none available). And the ability to specialize the style for a few files within that configuration file, though it seems a bit overkill to go down to that level of details.

Yea, a per-folder configuration definitely sounds like a good idea, as we’ll probably want to traverse the directory tree to find the configuration anyway.

Thanks for your input!
/Manuel

I'm glad to see work in this area!

A few months ago, I set out to accomplish something similar. I didn't make much progress because I wanted to utilize the Python bindings instead of writing C++ (mainly because I think tools like this are much easier to implement in higher-level languages). Anyway, I was sidetracked because the Python bindings were lacking features needed to implement such a tool. (This actually spawned much of my work on the Python bindings, which have been trickling into the tree.)

Anyway, you may be interested in clanalyze [1]. My goal for clanalyze was to build APIs for inspecting the lexer output and AST. It is essentially a higher-level API for the Python bindings. I didn't get very far, so the code isn't that useful. What I do think is worth considering is the concept of an API that sits above the lexer output and AST that provides commonly-needed functionality. I think any tool that does code formatting, style critiquing, documentation generation, etc, is bound to perform many of the same tasks. e.g. "get a list of all the functions in a class," "list all the classes declared in this file," "return the comment block(s) before this declaration."

Since something like clang-format would need such abilities and since there are many other tools that could reuse this logic, I think there is value in implementing a higher-level "query" interface as a standalone library/module. Such an interface could also provide a bridge between the lexer tokens and AST (you need to use both for things like getting the comment block(s) before a declaration).

I didn't get very far with clanalyze (yet). Hopefully you can do better.

Also worth mentioning is linty [2].

Greg

[1] https://github.com/indygreg/clanalyze
[2] https://github.com/holtgrewe/linty

Hi,

we’re working on the design of clang-format; there are quite a few open
questions, but I’d rather get some early feedback to see whether we’re
completely off track somewhere.

I’m glad to see work in this area!

A few months ago, I set out to accomplish something similar. I didn’t
make much progress because I wanted to utilize the Python bindings
instead of writing C++ (mainly because I think tools like this are much
easier to implement in higher-level languages). Anyway, I was

I think you’re targeting something different here (worth pursuing, for sure, but different).

clang-format is a tool to reformat code. The main use case we have is to reformat code after a refactoring. Our main goal is to write refactoring tools & services. That’s also why it must be implemented in C++ - we want to be able to combine refactorings and the clean-up in one process and output only changes.

Now the second level above this is what I call a “linter-style” tool. This might well be better implemented in python on top of a query API (I doubt it, but I might be wrong). That seems more what you’re trying to build. I want to see such a tool, too, but we won’t get that as one of our priorities any time soon (I know an engineer who would be interested in contributing, but he won’t have enough time to do the design work).

I’d suggest before going too far down a path here, to write up a short design idea, and send it past this list, so you can get feedback by people who understand the implications of the different approaches very well and can catch dead ends before we hit them…

Cheers,
/Manuel

Hi Manuel,

Hi,

we’re working on the design of clang-format; there are quite a few open questions, but I’d rather get some early feedback to see whether we’re completely off track somewhere.

Great!

The main doc is on google docs:
https://docs.google.com/document/d/1gpckL2U_6QuU9YW2L1ABsc4Fcogn5UngKk7fE5dDOoA/edit

For those of you who prefer good old email, here is a copy of the current state. Feel free to add comments in either.
# Design: clang-formatThis document contains a design proposal for a clang-format tool, which allows C++ developers to automatically format their code independently of their development environment.
## ContextWhile many other languages have auto-formatters available, C++ is still lacking a tool that fits the needs of the majority of C++ programmers. Note that when we talk about formatting as part of this document, we mean both the problem of indentation (which has been largely solved independently by regexp-based implementations in editors / IDEs) and line breaking, which proves to be a harder problem.

There are multiple challenges to formatting C++ code:

  • a vast number of different coding styles has evolved over time
  • many projects value consistency over conformance and dislike style-only changes, thus making it important to be able to work with code that is not written according to the most current style guide
  • macros need to be handled properly
  • it should be possible to format code that is not yet syntactically correct

Goals- Format a whole file according to a configuration

  • Format a part of a file according to a configuration
  • Format a part of a file while being consistent as best as possible with the rest of the file, while falling back to a configuration for options that cannot be deduced from the current file
  • Integrating with editors so that you can just type away until you’re far past the column limit, and then hit a key and have the editor layout the code for you, including placing the right line breaks

Non-goals- Indenting code while you type; this is a much simpler problem, but has even stronger performance requirements - the current editors should be good enough, and we’ll allow new workflows that don’t ever require the user to break lines

  • The only lexical elements clang-format should touch are: whitespaces, string-literals and comments. Any other changes ranging from ordering includes to removing superfluous paranthesis are not in the scope of this tool.
  • Per-file configuration: be able to annotate a file with a style which it adheres to (?)

Code locationClang-format is a very basic tool, so it might warrant living in clang mainstream. On the other hand it would also fit nicely with other clang refactoring tools. TODO: Where do we want clang-format to live?

I think clang-format should live with the refactoring tools, wherever they go. However, refactoring is going to be a crucial technology for Clang going forward, which almost certainly means that it should migrate into mainline Clang at some point.

## Parsing approachThe key consideration is whether clang-format can be based purely on a lexer, or whether it needs type information, and we need the full AST.

We believe that we will need the full AST information to correctly indent code, break lines, and fix whitespace within a line.

Examples:

AST-dependent indentation:
callFunction(foo<something,
^ line up here, if foo is a template name
^ line up here otherwise

AST-dependent line breaking:
Detecting that ‘*’ is an binary operator in this case requires parsing; if it is a binary operator, we want to line-break after it, if it is a unary operator, we want to prevent line breaking

result = variable1 * variable2;

AST-dependent whitespace inside lines:
a * b;
^ Binary operator or pointer declaration?
a & f();
^ Binary operator or function declaration?

I wonder how well we can do simply with the lexer and preprocessor. Introducing the AST traversal adds a lot of complication, but you’re right that it’s necessary to do a great job.

Challenge: Preprocessor
Not every line in a program is covered by the AST - for example, there are unused macro definitions, various preprocessor directives, #ifdef’ed out code, etc.

We will at least need some form of lexing approach for the parts of a source file that cannot be correctly indented / line broken by looking at the AST.

## AlgorithmVisit all nodes on the AST; for each node that is part of a macro expansion, consider all locations taking part in that macro expansion. If the location is within the range that need to be indented, look at the code at the location, the rules around the node, and adjust whitespace as necessary. If the node starts a line, adjust the indent; if a node overflows the line, break the line. TODO: figure out what to do with the lines that are not visited that way.
## ConfigurationTo support a majority of developers, being able to configure the desired style is key. We propose using a YAML configuration file, as there’s already a YAML parser readily available in LLVM. Proposals for more specific ideas welcome.

Seems reasonable.

## Style deductionWhen changing the format of code that does not conform to a given style configuration, we will optionally try to deduce style options from the file first, and fall back to the configured layout when there was no clear style deducible from the context.
TODO: Detailed design ideas.

Yes, please!

## InterfaceThis is a strawman. Please shoot down.

Command line interface:
Command line interfaces allow easy integration with existing tools and editors.

USAGE: clang-format [ […]] [-- list of command line arguments to the parser]

: Specifies a code range to be reformatted; if no code range is given, assume the whole file.

For an editor to use this functionality efficiently, we’ll want it to go into a shared library (e.g., libclang).

I’m thrilled that you’re looking into this!

  • Doug

I added a entry under “goals” for this.

I think that if we want this to go into libclang, it might make sense to develop it in mainline as a library (or as part of the refactoring library) from the start.

Thoughts?
/Manuel

Yes, I agree.

  - Doug

There's an interesting paper here
https://mailserver.di.unipi.it/ricerca/proceedings/ETAPS05/ldta/ldta2005_p02.pdf
(nothing to do with clang) which refers to something called the LL-AST where
LL means 'literal layout'. Is that what is meant by 'full AST'?

This might be a bit tricky since the AST doesn't have the whitespace,
comment, and macro info in it :frowning: Maybe have to match it up somehow with
-dump-raw-tokens?

Lastly, function definitions don't seem to get handled very well by
-ast-dump; they just seem to get pretty printed :frowning: How to workaround that
limitation?