RFC: clang-doc proposal

This proposal is to build a new Clang tool for generating C/C++ documentation, similar to other modern documentation generators such as rustdoc. This tool would be a modular and extensible documentation generator for C/C++ code. It also would introduce a more elegant way of documenting C/C++ code, as well as maintaining backwards-compatibility with Doxygen-style markup to generate documentation.

Today, Doxygen is a de-facto standard for generating C/C++ documentation. While widely used, the tool itself is a bit cumbersome, its output is both aesthetically and functionally lacking, and the non-permissive license combined with outdated codebase make any improvements difficult. This new tool would aim to simplify the overhead of generating documentation, integrating it into a Clang tool as well as allowing existing comments to continue to be used. It would also allow for relatively easy adaptation to new language features, as it would be built on the Clang parser and would use the Clang AST to generate documentation.

Proposed Tool

The proposed tool would consist of two parts. First, it would have a frontend that consumes the Clang-generated AST and generates an intermediate representation of the code and documentation structure, including additional Markdown files. Second, it would have a set of backend modules that consume that representation and output documentation in a particular format (e.g. Markdown, HTML/website, etc.).

The frontend would be a new tool that uses the Clang parser, which can already parse C/C++ documentation comments (using -Wdocumentation option). It can be easily used through the LibTooling interface, similarly to other Clang tools such as clang-check or clang-format. The initial steps in this project would be to build this tool using Clang’s documentation parser. This tool would be able to attach comments to both functions, types, and macros and resolve declaration references, both of which will be useful in generating effective documentation. Since a good deal of existing C/C++ code uses the Doxygen documentation comment style, which is also supported by Clang’s parser (and Doxygen itself can use Clang to parse these comments), this is the syntax we are going to support as well. In the future, we would also like to support Markdown-style comments, akin to Apple Swift Markup.

For implementation, this tool will use the JSON Compilation Database format to integrate with existing build systems. It would also have subcommands to choose which parts of the code will be documented (e.g. all code, all public signatures, all comment-documented signatures). Once the code is processed, the tool will write out the internal representation of the the documentation in an intermediate representation, encapsulating the necessary information about the code, comments, and structure. This will allow backend tools to take the output and transform it as necessary.

The backend modules would cover different possible outputs for the defined intermediate representation. Each module will consume the representation and output documentation in a specific format. Initially, we propose to focus on a module that generates Markdown files, in order to make the first version as simple as possible. Markdown files are automatically rendered on a number of sites and systems, as well as being clear and uncluttered in raw text form. It is also relatively easy to convert Markdown files into other formats, making it a good starting target. An additional module would target HTML/website output.

Intermediate Format

The frontend would process the code and comments into an output, to be consumed by the backend. This representation would be internally represented as a set of classes and structs. Once the frontend has finished, it would write this representation to a file. While existing tools like Doxygen emit XML, XML is somewhat restrictive and bulky. Also, in order to fully use XML, the tool would need to define the representation twice (once for the internal classes/structs, once in the XML schema). So, we are instead considering two possible formats for this intermediate step: LLVM bitstream and JSON/YAML.

LLVM bitstream format is space-efficient, and is natively written out by the Clang parser. It has the benefit of being similar to existing clang functionality, as the compiler frontend writes out its AST into the bitstream format to pass along to the LLVM backend. Using this format would allow the tool to emit the representation with minimal manipulation or additional parsing.

Alternatively, JSON/YAML, while less space-efficient than bitstream, are human-readable and widely extensible. Neither has formal grammar or namespacing support, so if the tool needed rules of the sort it would need to define them itself on the frontend and require that the backend modules know them. While this would require a bit more parsing to emit on the frontend and load on the backend, the representation would be able to stand separately from the tool, and the backend modules would not necessarily need an understanding of the LLVM bitstream to load it.

Extensions

In addition to generating documentation from comments, a future extension would be to automatically generate and insert boilerplate comments into the code on demand. As the tool would have access to the AST, it could insert comments into the code similar to how tools like clang-tidy and clang-format adjust the code. Such generated comments would follow the documentation style for comments, and so would generate basic, if not wholly described, documentation, including information about parameters, return types, class members, etc. For example, the following would be generated for the below function:

/// Do Things

///

/// TODO: Write detailed description

///

/// \param value

/// \return int

int doThings(int value) { return value; }

In addition, the parsing tool could also be expanded to also parse Markdown-style comments, using the Apple Swift Markup style as a reference.

Please let us know if you have comments or concerns about this proposal.

Thanks!

Julie

Needless to say, a good deal of existing C/C++ code does not use the
Doxygen documentation comment style, or any other explicit markup style.
Additionally, doxygen (and similar things like Javadoc) sacrifice some
readability of the comments for the documentation.

For these reasons, it'd be great to have first-class support for "plain"
comments. (Even though it's a hard problem)

Hi.
I'm pretty much just a user, but i have some thoughts.

This proposal is to build a new Clang tool for generating C/C++
documentation, similar to other modern documentation generators such as
rustdoc. This tool would be a modular and extensible documentation
generator for C/C++ code. It also would introduce a more elegant way of
documenting C/C++ code, as well as maintaining backwards-compatibility with
Doxygen-style markup to generate documentation.

Today, Doxygen is a de-facto standard for generating C/C++ documentation.
While widely used, the tool itself is a bit cumbersome, its output is both
aesthetically and functionally lacking, and the non-permissive license
combined with outdated codebase make any improvements difficult. This new
tool would aim to simplify the overhead of generating documentation,
integrating it into a Clang tool as well as allowing existing comments to
continue to be used. It would also allow for relatively easy adaptation to
new language features, as it would be built on the Clang parser and would
use the Clang AST to generate documentation.

Sounds awesome so far.

Proposed Tool

The proposed tool would consist of two parts. First, it would have a
frontend that consumes the Clang-generated AST and generates an intermediate
representation of the code and documentation structure, including additional
Markdown files. Second, it would have a set of backend modules that consume
that representation and output documentation in a particular format (e.g.
Markdown, HTML/website, etc.).

The frontend would be a new tool that uses the Clang parser, which can
already parse C/C++ documentation comments (using -Wdocumentation option).
It can be easily used through the LibTooling interface, similarly to other
Clang tools such as clang-check or clang-format. The initial steps in this
project would be to build this tool using Clang's documentation parser. This
tool would be able to attach comments to both functions, types, and macros
and resolve declaration references, both of which will be useful in
generating effective documentation. Since a good deal of existing C/C++ code
uses the Doxygen documentation comment style, which is also supported by
Clang's parser (and Doxygen itself can use Clang to parse these comments),
this is the syntax we are going to support as well. In the future, we would
also like to support Markdown-style comments, akin to Apple Swift Markup.

For implementation, this tool will use the JSON Compilation Database format
to integrate with existing build systems. It would also have subcommands to
choose which parts of the code will be documented (e.g. all code, all public
signatures, all comment-documented signatures). Once the code is processed,
the tool will write out the internal representation of the the documentation
in an intermediate representation, encapsulating the necessary information
about the code, comments, and structure. This will allow backend tools to
take the output and transform it as necessary.

The backend modules would cover different possible outputs for the defined
intermediate representation. Each module will consume the representation and
output documentation in a specific format. Initially, we propose to focus on
a module that generates Markdown files, in order to make the first version
as simple as possible. Markdown files are automatically rendered on a number
of sites and systems, as well as being clear and uncluttered in raw text
form. It is also relatively easy to convert Markdown files into other
formats, making it a good starting target. An additional module would target
HTML/website output.

While i understand the reasoning, I'm not sure the backends is a great idea.
TLDW: how about *only* outputting RST (well, or MD) and delegating the
rest to the sphinx? This *should* allow for native integration into sphinx-based
documentation, which is currently not achievable natively with Doxygen.

Intermediate Format

The frontend would process the code and comments into an output, to be
consumed by the backend. This representation would be internally represented
as a set of classes and structs. Once the frontend has finished, it would
write this representation to a file. While existing tools like Doxygen emit
XML, XML is somewhat restrictive and bulky. Also, in order to fully use XML,
the tool would need to define the representation twice (once for the
internal classes/structs, once in the XML schema). So, we are instead
considering two possible formats for this intermediate step: LLVM bitstream
and JSON/YAML.

LLVM bitstream format is space-efficient, and is natively written out by the
Clang parser. It has the benefit of being similar to existing clang
functionality, as the compiler frontend writes out its AST into the
bitstream format to pass along to the LLVM backend. Using this format would
allow the tool to emit the representation with minimal manipulation or
additional parsing.

Alternatively, JSON/YAML, while less space-efficient than bitstream, are
human-readable and widely extensible. Neither has formal grammar or
namespacing support, so if the tool needed rules of the sort it would need
to define them itself on the frontend and require that the backend modules
know them. While this would require a bit more parsing to emit on the
frontend and load on the backend, the representation would be able to stand
separately from the tool, and the backend modules would not necessarily need
an understanding of the LLVM bitstream to load it.

I'm not seeing any mention of graph/diagram generation.

Extensions

In addition to generating documentation from comments, a future extension
would be to automatically generate and insert boilerplate comments into the
code on demand. As the tool would have access to the AST, it could insert
comments into the code similar to how tools like clang-tidy and clang-format
adjust the code. Such generated comments would follow the documentation
style for comments, and so would generate basic, if not wholly described,
documentation, including information about parameters, return types, class
members, etc. For example, the following would be generated for the below
function:

/// Do Things

///

/// TODO: Write detailed description

///

/// \param value

/// \return int

int doThings(int value) { return value; }

In addition, the parsing tool could also be expanded to also parse
Markdown-style comments, using the Apple Swift Markup style as a reference.

Please let us know if you have comments or concerns about this proposal.

Thanks!

Julie

Roman

Oops, forgot to mention that this is a fantastic idea, Doxygen is long overdue for a replacement.
Good luck!

I’m currently working on standardese, which aims to do exactly that: github.com/foonathan/standardese

It is designed in a similar way, but the intermediate format is just in-memory.

standardese uses libclang (which was a mistake in hindsight), and has its own comment parser as it also supports markup and special commands.

This works as described, but the intermediate representation isn’t written out.

Multiple backends are supported (the new version on develop currently only has XML and HTML though)

The backend modules would cover different possible outputs for the defined
intermediate representation. Each module will consume the representation and
output documentation in a specific format. Initially, we propose to focus on
a module that generates Markdown files, in order to make the first version
as simple as possible. Markdown files are automatically rendered on a number
of sites and systems, as well as being clear and uncluttered in raw text
?> form. It is also relatively easy to convert Markdown files into other
formats, making it a good starting target. An additional module would target
HTML/website output.

While i understand the reasoning, I’m not sure the backends is a great idea.
TLDW: how about only outputting RST (well, or MD) and delegating the
rest to the sphinx? This should allow for native integration into sphinx-based
documentation, which is currently not achievable natively with Doxygen.

We’re looking to have something a little more flexible for the intermediate format than MD, to allow for a number of different formats for output in the future. The type of integration you’re suggesting would be a great extension to the backend module that emits MD, which would be fairly easy to add into this.

Julie,

This proposal touches on some areas that I know quite well. As a co-founder of the DoxyPress project I have learned a great deal about C++ documentation and I believe I can provide some useful context and experience.

There are many who believe, inaccurately, that Doxygen supports clang for parsing. This is misleading and not what their "Clang assisted parsing" option does. In reality, Doxygen uses the clang frontend when the Doxygen lex parser gets into an error state while parsing a template declaration. All C++ parsing and lexing is done in a hand written lexer. On the other hand, the clang parsing option in DoxyPress actually does what you would expect, and uses the clang frontend. All documentation is then generated based on the AST.

I agree, Doxygen was the de facto standard, but it should no longer be considered for documenting modern C++ code since a hand-rolled lexer simply cannot keep up with current syntax. It is worth mentioning the clang documentation generated by DoxyPress is far better than the current Doxygen documentation. If you would like to see it I can post it somewhere for your review.

If you are looking to add a documentation tool to the clang suite, there are a lot more questions that will come up in the process. If all you are looking to do is document clang itself and clang tooling, that is a much more constrained use case.

As a side note, the code in clang that supports -Wdocumentation is very limited and only parses a very minimal set of Doxygen/DoxyPress comment syntax. When we looked at it last year, it appeared that a great deal of work would need to be done to support the majority of comments that will be found in an existing C++ codebase which uses Doxygen.

I would encourage more discussion and a bit more thought on narrowing down the problem you would like to solve. If there are additional questions about how we addressed these issues in DoxyPress we are very happy to be part of the discussion.

Hope you find some of these comments useful.

Thanks for all the feedback – we’re open to collaboration, so any input or thoughts are welcome! On that note, would anyone be willing to review for this?

Julie

Hi Julie,

I saw that the first bits already landed. Congratulations, great work! :slight_smile:
I have a few questions.

Do you plan to include the backends in the clang (tools) repository?

I think it would be great to have at least one reference backend next to the frontend.

Building an AST can be quite expensive. Do you plan to support generating documentation as a by-product of building the code? Similar to how indexing while building was proposed by Apple.

Do you plant the intermediate representation to be self-contained, or the backends will need access to the original files (available at the original paths)?

Regards,

Gábor

Do you plan to include the backends in the clang (tools) repository?

I think it would be great to have at least one reference backend next to the frontend.

Yes, there’s currently YAML and Markdown generators in the works (see https://reviews.llvm.org/D43667 and https://reviews.llvm.org/D43424 for rough details, the markdown one in particular is still in flux).

Building an AST can be quite expensive. Do you plan to support generating documentation as a by-product of building the code? Similar to how indexing while building
was proposed by Apple.

A good idea – for right now, that’s not supported, but I’d definitely be interested in extending it to be a by-product.

Do you plant the intermediate representation to be self-contained, or the backends will need access to the original files (available at the original paths)?

The representation is self-contained.

Please let me know if you have more questions!
Julie