[GSoC] Doxygen documentation with clang

Hello clang-devel,

one of the project ideas for GSoC 2014 is a clang-based tool to
generate documentation using doxygen-style comments in the source
code. I wanted to gauge the interest into such a project, see if
someone is willing to mentor it, and provide a rough outline of what
my idea of the project is. Any feedback on this is very welcome.

2 Prior Work
════════════

• clang already understands doxygen-style comments to a degree and
  attaches them to the ast:
  [http://llvm.org/devmtg/2012-11/Gribenko_CommentParsing.pdf\]
• doxygen can already use clang as a backend
  [http://comments.gmane.org/gmane.comp.compilers.clang.devel/29490\]
• there already is a cldoc [https://github.com/jessevdk/cldoc\]

3 Project Plan
══════════════

3.1 Fully parse doxygen comments
────────────────────────────────

Doxygen supports markdown, HTML entities, if/endif, post-definition
documentation, file scope doc, function groups, member groups, pages,
page hierarchies, examples, links, auto-links, and todo/bug lists (the
dreaded xrefitem).

Some of those features might seem like overkill but they usually ended
up in doxygen because someone wanted them and they are actively used
in "the real world" (c).

The CommentParser should do its best to represent those in a useful
fashion in the CommentAST (especially link resolving) so tools further
down the chain can focus on their tasks only.

3.2 To represent intermediately or not
──────────────────────────────────────

The actual documentation generation tool has two options:
• use libclang, work on the AST directly and spit out documenation.
• let clang produce some intermediate representation (XML?) and work
  on this

The first option seems to be the easy road but would tie the
generation directly into clang. It also seems harder to extend and
reuse.

The second option is probably the most general approach. Generating XML
to represent the AST is actually proposed as its own GSoC project. Maybe
it would be possible to produce a reduced XML only containing
declarations and comments that could later be extended to feature the
full AST. Designing this schema is probably non-trivial and should be
well thought through.

The slides on -Wdocumentation already mention the ability to produce
XML but I couldn't figure out yet how to get that to work. From
glancing at the schema in bindings/xml/comment-xml-schema.rng it looks
pretty useful already, but some features (header dependencies,
inheritance relationships) are AFAIK missing.

The main benefit of an intermediate representation would be to enable
us to build something akin to doxygen's "external projects" feature,
which is incredibly useful (not having it would be a deal-breaker for
some of my own projects).

3.3 Actual Generators
─────────────────────

How should the actual generators be defined?

What we should strive for: configuration from doxygen directly
portable, almost no configuration for the common case, at least simple
HTML and LaTeX for starters.

There are almost endless possibilities to do this and all of them have
different trade-offs.

3.3.1 The doxygen way
╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌

One driver (with many special cases) calls into generator
back-ends. All defined in C++. Almost no room for customization (css
in HTML, custom headers in LaTeX, nothing for the rest). Easy for the
implementer.

3.3.2 Templating engines
╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌

Provide templates, have a driver that populates them. Likely the most
general approach, but different template engines for different output
formats with different capabilities. Complexity is in the driver.

3.3.3 Database + Web-server
╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌

A special case for HTML. Provide a database and a web-frontend that
can be hosted. Seems interesting for fast search functions and live
documentation updates. clang-server where are you?

3.3.4 A shim for doxygen
╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌

Doxygen already can produce XML, but doesn't use it for anything
internally (and the XML isn't really that useful anyway). That
capability could be expanded, but for that the doxygen hackery would
be required.

3.4 Summary
───────────

It's a large project, but each stage provides functionality that could
be a contribution on its own.

4 Why not improve doxygen instead?
══════════════════════════════════

Doxygen is incredibly hard to hack on, burdened by backward
compatibility (going so far that it prevents obvious bugs from being
fixed), and supports a strange set of languages which are not really
C/C++ like, which makes a lot of changes impossible or very hard. The
support for templates is abysmal and hard to fix without eventually
introducing a full C++ AST. I'm not trying to bash doxygen here. I
used it build cool things and Dimitri is doing a better job at
maintenance than a lot of other OS devs.

5 Who am I anyway?
══════════════════

My name is Philipp Möller and I'm a MSc Computer Science student at
the University of Saarland, Germany. I have already participated in
GSoC 2011 with the CGAL www.cgal.org project. Why do I care about
documentation so much? I build a large part of the CGAL doxygen
documentation ([http://doc.cgal.org/latest/\]), got to know almost all
quirks and awesomeness of doxygen, and produced a few doxygen patches
in the process.

I'm used to making small contributions to open-source projects and
used to mailing-list communication.

6 TLDR
══════

Build a tool do generate documentation from doxygen-style comments
with clang that supports a lot of doxygens features. What do you think
of this project idea? Would there be a mentor for this project?

Cheers,
Philipp

Hello,

one of the project ideas for GSoC 2014 is a clang-based tool to
generate documentation using doxygen-style comments in the source
code. I wanted to gauge the interest into such a project, see if
someone is willing to mentor it, and provide a rough outline of what
my idea of the project is. Any feedback on this is very welcome.

This project sounds very interesting and definitely useful; "clang-doc" in addition
to clang-format and the static analyzer.

A few points:

1. Doxygen offers the possibility to define custom commands in its config file[1]. I am
   using this in my code to document the bindings to the scripting language Lua, i.e.
   @luaparam to document a parameter and @luareturn to document a return value. AFAIK
   clang currently does not support custom commands and warns about their use (unless
   -Wno-documentation-unknown-command is specified).

   Do you plan on extending clang so custom doxygen commands can be defined?

2. As far as the external projects you mention go: Maybe they could be implemented with
   precompiled header files, similar to the way clang's modules (in C and Objective-C) are
   handled[2]: By saving and loading clang's AST.

Jonathan

[1] <http://www.stack.nl/~dimitri/doxygen/manual/custcmd.html&gt;
[2] <http://clang.llvm.org/docs/Modules.html&gt;

Note that Clang supports -fcomment-block-commands= parameter, which
allows us to get the correct AST for custom commands. I think this
should also silence the warning.

Dmitri

"Jonathan 'Rynn' Sauer" <jonathan.sauer@gmx.de>
writes:

Hello,

one of the project ideas for GSoC 2014 is a clang-based tool to
generate documentation using doxygen-style comments in the source
code. I wanted to gauge the interest into such a project, see if
someone is willing to mentor it, and provide a rough outline of what
my idea of the project is. Any feedback on this is very welcome.

This project sounds very interesting and definitely useful;
"clang-doc" in addition to clang-format and the static analyzer.

A few points:

1. Doxygen offers the possibility to define custom commands in its
   config file[1]. I am using this in my code to document the bindings to
   the scripting language Lua, i.e. @luaparam to document a parameter
   and @luareturn to document a return value. AFAIK clang currently does
   not support custom commands and warns about their use (unless
   -Wno-documentation-unknown-command is specified).

   Do you plan on extending clang so custom doxygen commands can be
   defined?

Yes, that would certainly be necessary. I would go so far to say, that
it is necessary to support custom commands in "clang-doc" (tentative
name for now).

If custom commands are supported they would be defined in the
documentation generation file similar to Doxygen and the comment parser
would also handle their expansion. Additionally they could be passed as
command line parameters so -Wdocumentation-unknown-command can be used
during normal compilation.

See Dmitri's reply for a possible trick to get this working now.

2. As far as the external projects you mention go: Maybe they could be
   implemented with precompiled header files, similar to the way clang's
   modules (in C and Objective-C) are handled[2]: By saving and loading
   clang's AST.

Interesting approach, but Module support is experimental for C++ and
likely to change should they ever be standardized. I'm a little out of
the loop concerning the C++ modules proposal and unable to find
up-to-date information. (AFAIK, SG2 hasn't done any work on the proposal
during the last meetings.) Maybe someone more familiar with modules
could provide some insight if this is feasible.

Hello clang-devel,

one of the project ideas for GSoC 2014 is a clang-based tool to
generate documentation using doxygen-style comments in the source
code. I wanted to gauge the interest into such a project, see if
someone is willing to mentor it, and provide a rough outline of what
my idea of the project is. Any feedback on this is very welcome.

Hi Philipp,

Please excuse me for the late reply.

I am very interested in this project and I would be happy to mentor it.

Comments inline.

2 Prior Work
════════════

• clang already understands doxygen-style comments to a degree and
  attaches them to the ast:
  [http://llvm.org/devmtg/2012-11/Gribenko_CommentParsing.pdf\]
• doxygen can already use clang as a backend
  [http://comments.gmane.org/gmane.comp.compilers.clang.devel/29490\]
• there already is a cldoc [https://github.com/jessevdk/cldoc\]

3 Project Plan
══════════════

3.1 Fully parse doxygen comments
────────────────────────────────

Doxygen supports markdown, HTML entities, if/endif, post-definition
documentation, file scope doc, function groups, member groups, pages,
page hierarchies, examples, links, auto-links, and todo/bug lists (the
dreaded xrefitem).

Some of those features might seem like overkill but they usually ended
up in doxygen because someone wanted them and they are actively used
in "the real world" (c).

The CommentParser should do its best to represent those in a useful
fashion in the CommentAST (especially link resolving) so tools further
down the chain can focus on their tasks only.

Link resolving is pretty important when generating self-contained
documentation files. Clang does not attempt to resolve links right
now. I expect that implementing this will need a significant time
investment.

Another important missing feature is attaching comments to macros.
Currently Clang can only attach comments to declarations.

3.2 To represent intermediately or not
──────────────────────────────────────

The actual documentation generation tool has two options:
• use libclang, work on the AST directly and spit out documenation.
• let clang produce some intermediate representation (XML?) and work
  on this

The first option seems to be the easy road but would tie the
generation directly into clang. It also seems harder to extend and
reuse.

The second option is probably the most general approach. Generating XML
to represent the AST is actually proposed as its own GSoC project. Maybe
it would be possible to produce a reduced XML only containing
declarations and comments that could later be extended to feature the
full AST. Designing this schema is probably non-trivial and should be
well thought through.

The slides on -Wdocumentation already mention the ability to produce
XML but I couldn't figure out yet how to get that to work. From
glancing at the schema in bindings/xml/comment-xml-schema.rng it looks
pretty useful already, but some features (header dependencies,
inheritance relationships) are AFAIK missing.

The main benefit of an intermediate representation would be to enable
us to build something akin to doxygen's "external projects" feature,
which is incredibly useful (not having it would be a deal-breaker for
some of my own projects).

Clang's XML comment representation should accurately represent the
comment in a way that is:

- extensible and future-proof,
- allows us to change the AST while maintaining backward compatibility.

Information like inheritance relationships is omitted on purpose,
because it is expected that the client is using libclang and can query
the additional information as needed.

3.3.3 Database + Web-server
╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌

A special case for HTML. Provide a database and a web-frontend that
can be hosted. Seems interesting for fast search functions and live
documentation updates. clang-server where are you?

This looks like a very promising approach that does not just provide
the same functionality as Doxygen does, but introduces new value.
This can actually become the foundation for the clang-server itself!
The basic functionality for live updates -- tracking dependencies
between source files, indexing and reindexing will be useful for both
documentation server and clang-server.

Dmitri

Dmitri Gribenko <gribozavr@gmail.com> writes:

Hello clang-devel,

one of the project ideas for GSoC 2014 is a clang-based tool to
generate documentation using doxygen-style comments in the source
code. I wanted to gauge the interest into such a project, see if
someone is willing to mentor it, and provide a rough outline of what
my idea of the project is. Any feedback on this is very welcome.

Hi Philipp,

Please excuse me for the late reply.

I am very interested in this project and I would be happy to mentor
it.

Great to hear that. I have done a GSoC already and there are a few
things I thought worked really well with my last mentor. We probably
should go over them separately and see if we have the same idea of how
all this should work.

I'm also not familiar with llvm project politics and there is the
question how many slots llvm will get and how much promotion and
"lobbying" is necessary to make this thing happen. I was thinking about
cross-posting my mail to clang-dev to llvm-dev as well and see how much
of a response there is.

For a good application I would like to define a certain set of
milestones we want to achieve. If you have anything specific in mind,
please let me know.

Comments inline.

2 Prior Work
════════════

• clang already understands doxygen-style comments to a degree and
  attaches them to the ast:
  [http://llvm.org/devmtg/2012-11/Gribenko_CommentParsing.pdf\]
• doxygen can already use clang as a backend
  [http://comments.gmane.org/gmane.comp.compilers.clang.devel/29490\]
• there already is a cldoc [https://github.com/jessevdk/cldoc\]

3 Project Plan
══════════════

3.1 Fully parse doxygen comments
────────────────────────────────

Doxygen supports markdown, HTML entities, if/endif, post-definition
documentation, file scope doc, function groups, member groups, pages,
page hierarchies, examples, links, auto-links, and todo/bug lists (the
dreaded xrefitem).

Some of those features might seem like overkill but they usually ended
up in doxygen because someone wanted them and they are actively used
in "the real world" (c).

The CommentParser should do its best to represent those in a useful
fashion in the CommentAST (especially link resolving) so tools further
down the chain can focus on their tasks only.

Link resolving is pretty important when generating self-contained
documentation files. Clang does not attempt to resolve links right
now. I expect that implementing this will need a significant time
investment.

Yes, the way doxygen approaches linking is far from trivial
(auto-linking, namespace guessing, sometimes it considers scope) and
I would allocate a decent amount of time for this.

There also would need to be some way to defer linking when external
projects are involved (possibly marking certain chunks of a comment as
linkable, but unresolved).

Another important missing feature is attaching comments to macros.
Currently Clang can only attach comments to declarations.

Is this due to a limitation of the AST or just a feature that you
skipped in your first batch of -Wdocumentation? I always assumed it
shouldn't be too hard, but then that's probably just me being naive.

3.2 To represent intermediately or not
──────────────────────────────────────

The actual documentation generation tool has two options:
• use libclang, work on the AST directly and spit out documenation.
• let clang produce some intermediate representation (XML?) and work
  on this

The first option seems to be the easy road but would tie the
generation directly into clang. It also seems harder to extend and
reuse.

The second option is probably the most general approach. Generating XML
to represent the AST is actually proposed as its own GSoC project. Maybe
it would be possible to produce a reduced XML only containing
declarations and comments that could later be extended to feature the
full AST. Designing this schema is probably non-trivial and should be
well thought through.

The slides on -Wdocumentation already mention the ability to produce
XML but I couldn't figure out yet how to get that to work. From
glancing at the schema in bindings/xml/comment-xml-schema.rng it looks
pretty useful already, but some features (header dependencies,
inheritance relationships) are AFAIK missing.

The main benefit of an intermediate representation would be to enable
us to build something akin to doxygen's "external projects" feature,
which is incredibly useful (not having it would be a deal-breaker for
some of my own projects).

Clang's XML comment representation should accurately represent the
comment in a way that is:

- extensible and future-proof,
- allows us to change the AST while maintaining backward
compatibility.

OK, I didn't feel qualified to make those statements. Knowing you think
the XML is suitable makes it a very viable candidate.

Information like inheritance relationships is omitted on purpose,
because it is expected that the client is using libclang and can query
the additional information as needed.

This would make it necessary that the XML carries the information to
build all the translation units that were used to generate the XML in
the first place, correct? Or is libclang able to deserialize the XML to
reconstruct the AST?

If the XML would already provide this information we would get loser
coupling, but it wouldn't be as flexible and every time new, helpful
information is discovered the XML would need to evolve.

I prefer your approach.

3.3.3 Database + Web-server
╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌

A special case for HTML. Provide a database and a web-frontend that
can be hosted. Seems interesting for fast search functions and live
documentation updates. clang-server where are you?

This looks like a very promising approach that does not just provide
the same functionality as Doxygen does, but introduces new value.
This can actually become the foundation for the clang-server itself!
The basic functionality for live updates -- tracking dependencies
between source files, indexing and reindexing will be useful for both
documentation server and clang-server.

The main question here seems how to represent the persistent AST:

- relational DB
- NoSQL
- graph DB

all seem like they could work and I don't have a clear idea how either
of them is going to perfom.

Updating the mapping can probably be done with different granularity
giving better performance on lower granularity but being harder to
implement.

IIRC there used to be design documents for clang-server somewhere on the
web. I'll look for them to get a clearer picture of the
requirements.

I'm absolutely not averse to working on this, but maybe we should focus
on first improving compatibility with Doxygen and comment parsing and
move into this topic latter. There seems plenty of work in the first
stages already.

Cheers,
Philipp

Dmitri Gribenko <gribozavr@gmail.com> writes:

Hello clang-devel,

one of the project ideas for GSoC 2014 is a clang-based tool to
generate documentation using doxygen-style comments in the source
code. I wanted to gauge the interest into such a project, see if
someone is willing to mentor it, and provide a rough outline of what
my idea of the project is. Any feedback on this is very welcome.

Hi Philipp,

Please excuse me for the late reply.

I am very interested in this project and I would be happy to mentor
it.

Great to hear that. I have done a GSoC already and there are a few
things I thought worked really well with my last mentor. We probably
should go over them separately and see if we have the same idea of how
all this should work.

I'm also not familiar with llvm project politics and there is the
question how many slots llvm will get and how much promotion and
"lobbying" is necessary to make this thing happen. I was thinking about
cross-posting my mail to clang-dev to llvm-dev as well and see how much
of a response there is.

I don't think there is much politics involved.

For a good application I would like to define a certain set of
milestones we want to achieve. If you have anything specific in mind,
please let me know.

Based on the discussion so far, I think this could be used as a draft plan:

- attaching comments to macros;
- parsing the reference syntax (recognising that the text from here to
there is a possible reference, which we will need to resolve).
Implementing Comment AST representation for unresolved references.
Designing and implementing the XML representation for unresolved
references.
- resolving links to decls within the TU. The result should probably
be a Decl* or a USR. The USR should be available in the XML;
- defining a schema for a DB to store information about possible link
targets (declarations and macros);
- populating DB with information from TUs in the project;
- resolving links to decls cross-TU using the DB. The result should
be a USR, and maybe the source file name + source location.

Does this sound reasonable? What do you think?

This already looks like a lot of work, so I am not sure if actually
writing a tool that is going to produce HTML or LaTeX is going to fit
in... Maybe only a skeleton of such a tool.

2 Prior Work
════════════

• clang already understands doxygen-style comments to a degree and
  attaches them to the ast:
  [http://llvm.org/devmtg/2012-11/Gribenko_CommentParsing.pdf\]
• doxygen can already use clang as a backend
  [http://comments.gmane.org/gmane.comp.compilers.clang.devel/29490\]
• there already is a cldoc [https://github.com/jessevdk/cldoc\]

3 Project Plan
══════════════

3.1 Fully parse doxygen comments
────────────────────────────────

Doxygen supports markdown, HTML entities, if/endif, post-definition
documentation, file scope doc, function groups, member groups, pages,
page hierarchies, examples, links, auto-links, and todo/bug lists (the
dreaded xrefitem).

Some of those features might seem like overkill but they usually ended
up in doxygen because someone wanted them and they are actively used
in "the real world" (c).

The CommentParser should do its best to represent those in a useful
fashion in the CommentAST (especially link resolving) so tools further
down the chain can focus on their tasks only.

Link resolving is pretty important when generating self-contained
documentation files. Clang does not attempt to resolve links right
now. I expect that implementing this will need a significant time
investment.

Yes, the way doxygen approaches linking is far from trivial
(auto-linking, namespace guessing, sometimes it considers scope) and
I would allocate a decent amount of time for this.

There also would need to be some way to defer linking when external
projects are involved (possibly marking certain chunks of a comment as
linkable, but unresolved).

Another important missing feature is attaching comments to macros.
Currently Clang can only attach comments to declarations.

Is this due to a limitation of the AST or just a feature that you
skipped in your first batch of -Wdocumentation? I always assumed it
shouldn't be too hard, but then that's probably just me being naive.

It just involves a completely different code path, through
Preprocessor. I don't expect implementing it to be too hard, but
probably not trivial either, and probably involving a lot of plumbing
though everywhere.

Information like inheritance relationships is omitted on purpose,
because it is expected that the client is using libclang and can query
the additional information as needed.

This would make it necessary that the XML carries the information to
build all the translation units that were used to generate the XML in
the first place, correct? Or is libclang able to deserialize the XML to
reconstruct the AST?

If the XML would already provide this information we would get loser
coupling, but it wouldn't be as flexible and every time new, helpful
information is discovered the XML would need to evolve.

Sorry, I did not explain clearly. Just to clear any possible misunderstandings:
- XML format is only for comments, not C, C++, Objective-C ASTs.
- XML format is not reversible to comment ASTs.

Currently clients already have a TranslationUnit when they query it
for the XML representation of the comment. XML is optimised for the
IDE usecase, where the XML will be rendered into some rich text view
in the IDE. If the client needs need extra information, it can query
it with very little overhead, because the TranslationUnit is already
in memory, and all the parsing and semantic analysis work was done.

OTOH, if we will decide on more offline approach, where comments in
XML format are stored after the TranslationUnit is destroyed, then we
either need to store more indexing info out-of-band, or add optional
pieces to the XML with that information.

3.3.3 Database + Web-server
╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌

A special case for HTML. Provide a database and a web-frontend that
can be hosted. Seems interesting for fast search functions and live
documentation updates. clang-server where are you?

This looks like a very promising approach that does not just provide
the same functionality as Doxygen does, but introduces new value.
This can actually become the foundation for the clang-server itself!
The basic functionality for live updates -- tracking dependencies
between source files, indexing and reindexing will be useful for both
documentation server and clang-server.

The main question here seems how to represent the persistent AST:

- relational DB
- NoSQL
- graph DB

all seem like they could work and I don't have a clear idea how either
of them is going to perform.

Do you have any previous experience with databases, or a particular
preference? I guess that if we use a portable subset of sqlite, then
the tool would be able to run on a wide variety of systems, make it
extremely easy to set up the tool, and leave a possibility of using a
more heavyweight database in future if needed.

Updating the mapping can probably be done with different granularity
giving better performance on lower granularity but being harder to
implement.

IIRC there used to be design documents for clang-server somewhere on the
web. I'll look for them to get a clearer picture of the
requirements.

Feel free to ask me questions about this.

I'm absolutely not averse to working on this, but maybe we should focus
on first improving compatibility with Doxygen and comment parsing and
move into this topic latter. There seems plenty of work in the first
stages already.

I completely agree, but resolving links cross-TU will require doing
some indexing of the source files. Certainly, in a Doxygen-related
GSoC we are not going to do any incremental indexing, and we are not
going to record more information than needed for this application, but
we should design the DB schema in a way that allows us to do this in
future.

Dmitri

Dear Dmitri, Philipp,

Regarding your (very interesting) project proposal, I have also been working on documentation using libclang. I have followed an approach that goes beyond Doxygen's capabilities.
I'd like to share with you my results so far.
http://jlopezvi.github.io/Flowgen/

Looking forward to your thoughts.
Best regards,

Juan Lopez-Villarejo

Dmitri Gribenko <gribozavr@gmail.com>
writes:

Dmitri Gribenko <gribozavr@gmail.com> writes:

Hello clang-devel,

one of the project ideas for GSoC 2014 is a clang-based tool to
generate documentation using doxygen-style comments in the source
code. I wanted to gauge the interest into such a project, see if
someone is willing to mentor it, and provide a rough outline of what
my idea of the project is. Any feedback on this is very welcome.

Hi Philipp,

Please excuse me for the late reply.

I am very interested in this project and I would be happy to mentor
it.

Great to hear that. I have done a GSoC already and there are a few
things I thought worked really well with my last mentor. We probably
should go over them separately and see if we have the same idea of how
all this should work.

I'm also not familiar with llvm project politics and there is the
question how many slots llvm will get and how much promotion and
"lobbying" is necessary to make this thing happen. I was thinking about
cross-posting my mail to clang-dev to llvm-dev as well and see how much
of a response there is.

I don't think there is much politics involved.

That's great. I'll submit a proposal based on this mail exchange to
Melange today and will keep improving it as we go further.

For a good application I would like to define a certain set of
milestones we want to achieve. If you have anything specific in mind,
please let me know.

Based on the discussion so far, I think this could be used as a draft plan:

- attaching comments to macros;
- parsing the reference syntax (recognising that the text from here to
there is a possible reference, which we will need to resolve).
Implementing Comment AST representation for unresolved references.
Designing and implementing the XML representation for unresolved
references.
- resolving links to decls within the TU. The result should probably
be a Decl* or a USR. The USR should be available in the XML;
- defining a schema for a DB to store information about possible link
targets (declarations and macros);
- populating DB with information from TUs in the project;
- resolving links to decls cross-TU using the DB. The result should
be a USR, and maybe the source file name + source location.

Does this sound reasonable? What do you think?

The first three stages seem very self-contained and we can probably add
them independently.

Designing the database for link resolution seems to be the biggest
challenge especially since it should be future-proof as you mention in
your last comment. I'll be working on designing some preliminary schema
so there is something more substantial to discuss.

This already looks like a lot of work, so I am not sure if actually
writing a tool that is going to produce HTML or LaTeX is going to fit
in... Maybe only a skeleton of such a tool.

I agree. The proposal I'll upload talks about a very basic HTML and
possibly a LaTeX generator to outline how the functionality can be used
to build a more general purpose tool.

2 Prior Work
════════════

• clang already understands doxygen-style comments to a degree and
  attaches them to the ast:
  [http://llvm.org/devmtg/2012-11/Gribenko_CommentParsing.pdf\]
• doxygen can already use clang as a backend
  [http://comments.gmane.org/gmane.comp.compilers.clang.devel/29490\]
• there already is a cldoc [https://github.com/jessevdk/cldoc\]

3 Project Plan
══════════════

3.1 Fully parse doxygen comments
────────────────────────────────

Doxygen supports markdown, HTML entities, if/endif, post-definition
documentation, file scope doc, function groups, member groups, pages,
page hierarchies, examples, links, auto-links, and todo/bug lists (the
dreaded xrefitem).

Some of those features might seem like overkill but they usually ended
up in doxygen because someone wanted them and they are actively used
in "the real world" (c).

The CommentParser should do its best to represent those in a useful
fashion in the CommentAST (especially link resolving) so tools further
down the chain can focus on their tasks only.

Link resolving is pretty important when generating self-contained
documentation files. Clang does not attempt to resolve links right
now. I expect that implementing this will need a significant time
investment.

Yes, the way doxygen approaches linking is far from trivial
(auto-linking, namespace guessing, sometimes it considers scope) and
I would allocate a decent amount of time for this.

There also would need to be some way to defer linking when external
projects are involved (possibly marking certain chunks of a comment as
linkable, but unresolved).

Another important missing feature is attaching comments to macros.
Currently Clang can only attach comments to declarations.

Is this due to a limitation of the AST or just a feature that you
skipped in your first batch of -Wdocumentation? I always assumed it
shouldn't be too hard, but then that's probably just me being naive.

It just involves a completely different code path, through
Preprocessor. I don't expect implementing it to be too hard, but
probably not trivial either, and probably involving a lot of plumbing
though everywhere.

Sounds like a perfect first task to tackle.

Information like inheritance relationships is omitted on purpose,
because it is expected that the client is using libclang and can query
the additional information as needed.

This would make it necessary that the XML carries the information to
build all the translation units that were used to generate the XML in
the first place, correct? Or is libclang able to deserialize the XML to
reconstruct the AST?

If the XML would already provide this information we would get loser
coupling, but it wouldn't be as flexible and every time new, helpful
information is discovered the XML would need to evolve.

Sorry, I did not explain clearly. Just to clear any possible misunderstandings:
- XML format is only for comments, not C, C++, Objective-C ASTs.
- XML format is not reversible to comment ASTs.

Currently clients already have a TranslationUnit when they query it
for the XML representation of the comment. XML is optimised for the
IDE usecase, where the XML will be rendered into some rich text view
in the IDE. If the client needs need extra information, it can query
it with very little overhead, because the TranslationUnit is already
in memory, and all the parsing and semantic analysis work was done.

OTOH, if we will decide on more offline approach, where comments in
XML format are stored after the TranslationUnit is destroyed, then we
either need to store more indexing info out-of-band, or add optional
pieces to the XML with that information.

Thanks, I was under the impression that the XML should represent at
least a subset of the AST and that the XML should be the sole input to a
documentation generator. This would obviously require it to contain much
more information.

3.3.3 Database + Web-server
╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌

A special case for HTML. Provide a database and a web-frontend that
can be hosted. Seems interesting for fast search functions and live
documentation updates. clang-server where are you?

This looks like a very promising approach that does not just provide
the same functionality as Doxygen does, but introduces new value.
This can actually become the foundation for the clang-server itself!
The basic functionality for live updates -- tracking dependencies
between source files, indexing and reindexing will be useful for both
documentation server and clang-server.

The main question here seems how to represent the persistent AST:

- relational DB
- NoSQL
- graph DB

all seem like they could work and I don't have a clear idea how either
of them is going to perform.

Do you have any previous experience with databases, or a particular
preference? I guess that if we use a portable subset of sqlite, then
the tool would be able to run on a wide variety of systems, make it
extremely easy to set up the tool, and leave a possibility of using a
more heavyweight database in future if needed.

Most of my database work has been with sqlite and it seems the most
portable of all the options and is also the least hassle for users.

I'll allocate some time in the schema design phase of the database to
research some alternatives more closely.

Dmitri Gribenko <gribozavr@gmail.com>
writes:

For a good application I would like to define a certain set of
milestones we want to achieve. If you have anything specific in mind,
please let me know.

Based on the discussion so far, I think this could be used as a draft plan:

- attaching comments to macros;
- parsing the reference syntax (recognising that the text from here to
there is a possible reference, which we will need to resolve).
Implementing Comment AST representation for unresolved references.
Designing and implementing the XML representation for unresolved
references.
- resolving links to decls within the TU. The result should probably
be a Decl* or a USR. The USR should be available in the XML;
- defining a schema for a DB to store information about possible link
targets (declarations and macros);
- populating DB with information from TUs in the project;
- resolving links to decls cross-TU using the DB. The result should
be a USR, and maybe the source file name + source location.

Does this sound reasonable? What do you think?

The first three stages seem very self-contained and we can probably add
them independently.

Maybe only the first two stages? In order to resolve links, we should
have parsed something first...

This already looks like a lot of work, so I am not sure if actually
writing a tool that is going to produce HTML or LaTeX is going to fit
in... Maybe only a skeleton of such a tool.

I agree. The proposal I'll upload talks about a very basic HTML and
possibly a LaTeX generator to outline how the functionality can be used
to build a more general purpose tool.

Sounds good.

It just involves a completely different code path, through
Preprocessor. I don't expect implementing it to be too hard, but
probably not trivial either, and probably involving a lot of plumbing
though everywhere.

Sounds like a perfect first task to tackle.

I agree. How much experience do you have with the Clang codebase?

Sorry, I did not explain clearly. Just to clear any possible misunderstandings:
- XML format is only for comments, not C, C++, Objective-C ASTs.
- XML format is not reversible to comment ASTs.

Currently clients already have a TranslationUnit when they query it
for the XML representation of the comment. XML is optimised for the
IDE usecase, where the XML will be rendered into some rich text view
in the IDE. If the client needs need extra information, it can query
it with very little overhead, because the TranslationUnit is already
in memory, and all the parsing and semantic analysis work was done.

OTOH, if we will decide on more offline approach, where comments in
XML format are stored after the TranslationUnit is destroyed, then we
either need to store more indexing info out-of-band, or add optional
pieces to the XML with that information.

Thanks, I was under the impression that the XML should represent at
least a subset of the AST and that the XML should be the sole input to a
documentation generator. This would obviously require it to contain much
more information.

I think an indexing DB would be a more promising approach. If we
stuff everything into XML, then certainly, that information can be
used by the documentation tool, but probably not by other tools.
OTOH, if we have a DB with a more-or-less extensible schema, we can
eventually build other tools to assist editors (go to definition, find
usages etc.)

3.3.3 Database + Web-server
╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌

A special case for HTML. Provide a database and a web-frontend that
can be hosted. Seems interesting for fast search functions and live
documentation updates. clang-server where are you?

This looks like a very promising approach that does not just provide
the same functionality as Doxygen does, but introduces new value.
This can actually become the foundation for the clang-server itself!
The basic functionality for live updates -- tracking dependencies
between source files, indexing and reindexing will be useful for both
documentation server and clang-server.

The main question here seems how to represent the persistent AST:

- relational DB
- NoSQL
- graph DB

all seem like they could work and I don't have a clear idea how either
of them is going to perform.

Do you have any previous experience with databases, or a particular
preference? I guess that if we use a portable subset of sqlite, then
the tool would be able to run on a wide variety of systems, make it
extremely easy to set up the tool, and leave a possibility of using a
more heavyweight database in future if needed.

Most of my database work has been with sqlite and it seems the most
portable of all the options and is also the least hassle for users.

I'll allocate some time in the schema design phase of the database to
research some alternatives more closely.

Looking forward to future collaboration.

Dmitri

Dmitri Gribenko <gribozavr@gmail.com>
writes:

For a good application I would like to define a certain set of
milestones we want to achieve. If you have anything specific in mind,
please let me know.

Based on the discussion so far, I think this could be used as a draft plan:

- attaching comments to macros;
- parsing the reference syntax (recognising that the text from here to
there is a possible reference, which we will need to resolve).
Implementing Comment AST representation for unresolved references.
Designing and implementing the XML representation for unresolved
references.
- resolving links to decls within the TU. The result should probably
be a Decl* or a USR. The USR should be available in the XML;

Hi Philipp,

I have discussed this proposal with Argyrios Kyrtzidis today, and we
agreed that it would be best to only include the above points plus the
following point

- defining exactly what kind of information should we retain while
indexing the translation unit in order to be able to resolve (yet
unknown set of) links against it.

... as 'required' in the GSoC proposal, and the work related to DB
mentioned below should be 'opportunity'.

- defining a schema for a DB to store information about possible link
targets (declarations and macros);
- populating DB with information from TUs in the project;
- resolving links to decls cross-TU using the DB. The result should
be a USR, and maybe the source file name + source location.

About the indexing, Argyrios suggested that it would be best to
decouple the process into two steps:

- indexing info should be produced as a separate file on disk (lets
call the file foo.clangindex)
- the tool can consume *.clangindex files to populate the database.

This separation allows us in future to produce indexing information
during compilation as a sidecar file, so that if the user is building
the project anyway, we don't parse the for indexing and compilation
twice.

What do you think?

Dmitri

Dmitri Gribenko <gribozavr@gmail.com>
writes:

For a good application I would like to define a certain set of
milestones we want to achieve. If you have anything specific in mind,
please let me know.

Based on the discussion so far, I think this could be used as a draft plan:

  • attaching comments to macros;
  • parsing the reference syntax (recognising that the text from here to
    there is a possible reference, which we will need to resolve).
    Implementing Comment AST representation for unresolved references.
    Designing and implementing the XML representation for unresolved
    references.
  • resolving links to decls within the TU. The result should probably
    be a Decl* or a USR. The USR should be available in the XML;

Hi Philipp,

I have discussed this proposal with Argyrios Kyrtzidis today, and we
agreed that it would be best to only include the above points plus the
following point

  • defining exactly what kind of information should we retain while
    indexing the translation unit in order to be able to resolve (yet
    unknown set of) links against it.

This seems really important. Linking is one of the quirkiest areas of Doxygen and getting it really right would be great.

… as ‘required’ in the GSoC proposal, and the work related to DB
mentioned below should be ‘opportunity’.

  • defining a schema for a DB to store information about possible link
    targets (declarations and macros);
  • populating DB with information from TUs in the project;
  • resolving links to decls cross-TU using the DB. The result should
    be a USR, and maybe the source file name + source location.

About the indexing, Argyrios suggested that it would be best to
decouple the process into two steps:

  • indexing info should be produced as a separate file on disk (lets
    call the file foo.clangindex)
  • the tool can consume *.clangindex files to populate the database.

This separation allows us in future to produce indexing information
during compilation as a sidecar file, so that if the user is building
the project anyway, we don’t parse the for indexing and compilation
twice.

Interesting. Is there some other part of clang that uses such files? How would I know to which code base to attach them? Or do we leave setting this up to the user? Is there a preference for certain data formats in clang?

About the indexing, Argyrios suggested that it would be best to
decouple the process into two steps:

- indexing info should be produced as a separate file on disk (lets
call the file foo.clangindex)
- the tool can consume *.clangindex files to populate the database.

This separation allows us in future to produce indexing information
during compilation as a sidecar file, so that if the user is building
the project anyway, we don't parse the for indexing and compilation
twice.

Interesting. Is there some other part of clang that uses such files?

Not yet. Nothing described in the paragraph above exists now.

How
would I know to which code base to attach them? Or do we leave setting this
up to the user?

Sorry, I don't understand the question here...

Is there a preference for certain data formats in clang?

In LLVM we usually prefer either stable future-proof text formats, or
binary files that are not guaranteed to be stable (because
forward/backward-compatibility for binary files is hard). For binary
files Clang and LLVM usually use the bitcode library from LLVM, which
allows one to encode data in space-efficient way. Because these
indexing files are supposed to be transient, and the amount of
information we might need to save from an average C++ translation unit
can be enormous, I think binary format would be the best.

Dmitri