Documenting Clang: question about how best to deliver the doc

As some of you know, we at Sony Computer Entertainment America have been working on various aspects of LLVM, including Clang and its toolchain. As part of our work, we have created documentation for our customers about using Clang, and we would like to share the fruits of our work with the Clang and LLVM communities.

As our first documentation submission, we plan to provide our CPU Intrinsics Guide, which documents the Clang intrinsics for x86intrin.h, along with several builtin and sync types. I’ve included a sample of what we document for one of the intrinsics below.

Our question for the community is: what documentation format is most helpful and desired for this information? We currently have two main possibilities in mind (with three variants for the first option):

  1. Add the documentation for each intrinsic to the header file:
  • 1a) Using Doxygen tagging. One benefit of this approach is that the documentation is available for the developer within a code-development/editing system. One potential difficulty with this approach is that the intrinsics header file becomes much larger, which could increase compile times.

  • 1b) Using Microsoft’s annotation grammar. We might be able to contain this annotation grammar within Doxygen tagging that deviates somewhat from the LLVM Doxygen style. This approach allows us to generate XML output for the Microsoft Visual Studio Tooltip class. The benefit of this approach is that the documentation is available for the developer within Visual Studio, without his or her having to open the specific header file. Like option (1a), one potential difficulty with this approach is that the intrinsics header file becomes much larger, which could increase compile times.

  • 1c) Using TblGen to maintain both the intrinsics definitions and their documentation, from which we generate the header file with both. With this approach, we could implement either option (1a), (1b), or both, and have a single point of maintenance. This option has the same benefits and drawbacks as (1a) and (1b).

  1. Add the documentation in reST and Sphynx format (to match existing Clang and LLVM documentation) to the Clang Web site. The main benefit of this approach is that the documentation is available to anyone on the Web.

Thus, we come to you today to ask your opinion on which approach we should take. We’re open to providing one or more of the formats, as desired, or considering a different option that one of you might make.

Sample intrinsic documentation (ASCII formatted for forum viewing)

As some of you know, we at Sony Computer Entertainment America have been
working on various aspects of LLVM, including Clang and its toolchain. As
part of our work, we have created documentation for our customers about
using Clang, and we would like to share the fruits of our work with the
Clang and LLVM communities.****

** **

As our first documentation submission, we plan to provide our *CPU
Intrinsics Guide*, which documents the Clang intrinsics for x86intrin.h,
along with several builtin and sync types. I've included a sample of what
we document for one of the intrinsics below.****

** **

Our question for the community is: what documentation format is most
helpful and desired for this information? We currently have two main
possibilities in mind (with three variants for the first option):****

** **

1) Add the documentation for each intrinsic to the header file:****

** **

- 1a) Using Doxygen tagging. One benefit of this approach is that the
documentation is available for the developer within a
code-development/editing system. One potential difficulty with this
approach is that the intrinsics header file becomes much larger, which
could increase compile times.

Would you mind measuring this, at least approximately? If the performance
penalty isn't substantial, I would lean towards this solution.

- 1b) Using Microsoft's annotation grammar. We might be able to contain

this annotation grammar within Doxygen tagging that deviates somewhat from
the LLVM Doxygen style. This approach allows us to generate XML output for
the Microsoft Visual Studio Tooltip class. The benefit of this approach is
that the documentation is available for the developer within Visual Studio,
without his or her having to open the specific header file. Like option
(1a), one potential difficulty with this approach is that the intrinsics
header file becomes much larger, which could increase compile times.

It's not clear to me why you can't generate XML output if you use LLVM
Doxygen style; do you just not have a tool that can do the appropriate
conversion?

- 1c) Using TblGen to maintain both the intrinsics definitions and their

documentation, from which we generate the header file with both. With this
approach, we could implement either option (1a), (1b), or both, and have a
single point of maintenance. This option has the same benefits and
drawbacks as (1a) and (1b).

Not really a fan of this; the intrinsics headers are hard enough to read
without using some custom format.

2) Add the documentation in reST and Sphynx format (to match existing Clang

and LLVM documentation) to the Clang Web site. The main benefit of this
approach is that the documentation is available to anyone on the Web.

If we have the docs in Doxygen format, we can convert them to HTML anyway.

Thus, we come to you today to ask your opinion on which approach we should
take. We're open to providing one or more of the formats, as desired, or
considering a different option that one of you might make.****

** **

** **

Sample intrinsic documentation (ASCII formatted for forum viewing)****

-------------------------------****

** **

_mm256_round_ps****

** **

SYNOPSIS****

#include <x86intrin.h>****

__m256 _mm256_round_ps(__m256 v, const int m);****

** **

INSTRUCTION****

VROUNDPS****

** **

DESCRIPTION****

Rounds the values stored in a packed 256-bit vector [8 x float] as
specified by the byte operand. The source values are rounded to integer
values and returned as floating point values.****

** **

PARAMETERS****

v A 256-bit vector of [8 x float] values.****

m An immediate byte operand specifying how the rounding is to
be performed.****

                Bits [7:4] are reserved.****

                Bit [3] is a precision exception value:****

                                0: A normal PE exception is used****

                                1: The PE field is not updated****

                Bit [2] is a rounding control source:****

                                0: MXCSR:RC****

                                1: Use the RC field value****

                Bit [1:0] contain the rounding control definition:****

                                00: Nearest****

                                01: Downward (toward negative infinity)***
*

                                10: Upward (toward positive infinity)****

                                11: Truncated****

** **

RETURNS****

A 256-bit vector of [8 x float] containing the rounded values.****

** **

-------------------------------

Very nice. :slight_smile:

-Eli

As Eli mentioned, it would be nice to get some performance numbers.

Sorry for the delay in responding.

Would you mind measuring this [increased size of the header file], at least approximately? If the performance penalty isn’t substantial, I would lean towards this solution.

We don’t have performance numbers yet. We can perform some measurements after the initial conversion from our base document to the new format is complete, so that we have real data in the header files to measure. We wouldn’t be able to simply measure the performance impact with fake data and -Wdocumentation (which parses Doxygen comments and generates warnings for malformed and mismatched comments).

It’s not clear to me why you can’t generate XML output if you use LLVM Doxygen style; do you just not have a tool that can do the appropriate conversion?

We don’t have the appropriate tool yet, but we plan to create one. Our initial idea was to keep things in LLVM Doxygen style and then use a tool to convert to the Microsoft XML style. We expected the community to prefer 1a (straight Doxygen tagging) over 1b (Microsoft annotation grammar), but mentioned 1b for completeness’ sake. Using 1b would avoid the need for the extra conversion step (LLVM Doxygen style to XML), but this conversion is not necessarily difficult.

Is generating MSVS Tooltip XML output a hard requirement for your use case?

Yes, generating the MSVS tooltip XML is a hard requirement for us.

In what format is the documentation currently?

Currently, the documentation is in Microsoft Word format (required for our SDK). We have a tool that converts this document into an XML format that includes a lot of meaningless tags for Word styles. A new tool would convert to a more useful XML format.

If so, can you estimate how much effort it is to convert to that format given each of these options?

In all cases, we will need to create a script to do an initial one-off conversion to a new format for any of the proposed solutions. After that, our estimations for the effort to implement solutions 1a and 1b are roughly on par with each other as relatively small efforts. Solution 1c requires a relatively larger effort because it requires changing how the intrinsics headers are generated and maintained. Solution 2 is fairly simple because we’d need to do much of the work for the one-off initial step in any case.

Our main concern is picking the path forward that is most preferred by the community. Given the discussion thus far, it sounds like the two front-runners are:

1a) Maintain the documentation in Doxygen. Any other formats required (HTML, XML, etc) can be generated on-the-fly from the Doxygen comments as needed.

  1. Maintain the documentation in reST/Doxygen. Any other formats required can be generated on-the-fly from the reST as needed. (Also, it seems that Sphynx has good plugins and support for converting to other formats.)

Cheers,

Michael

If you decide to go with Doxygen, make sure to check that it supports
all formatting you need. In particular, check how tables, nested
lists, and inline monospaced code works.

Dmitri

> Our main concern is picking the path forward that is most preferred by
the
> community.

If you decide to go with Doxygen, make sure to check that it supports
all formatting you need. In particular, check how tables, nested
lists, and inline monospaced code works.

There is motion in the doxygen community to move toward a Markdown based
syntax for more general purpose formatting. It seems very likely to remove
most of the problems with these features.

Our main concern is picking the path forward that is most preferred by
the community. Given the discussion thus far, it sounds like the two
front-runners are:

**

** **

1a) Maintain the documentation in Doxygen. Any other formats required
(HTML, XML, etc) can be generated on-the-fly from the Doxygen comments as
needed.

This has the advantage that -Wdocumentation will ensure that the docs and
prototypes/argument lists are in sync; I'm not sure how significant that is
since the code in these headers is mostly "write only" and not subject to
continual maintenance and evolution (at least, not to the extent of
"regular code").

Dmitri's point about formatting bears further investigation. You may want
to do a similar check for reST (although reST is quite rich already, so
hopefully nothing will be missing); however, reST is extensible from Sphinx
plugins (e.g. custom directives/roles) so there will always be a way to
work around any particular issue.

One concern for both approaches is how much code will be necessary to
translate Doxygen/Sphinx formatted text into the desired output format.
Will a simple "map this XML element to this other one" table be enough, or
will it require a lot of nasty hand-coded stuff ("for ordered lists do
this, for unordered lists do that, for italic do some other thing, for
monospace do yet another thing, etc")?

****

** 2) Maintain the documentation in reST/Doxygen. Any other formats
required can be generated on-the-fly from the reST as needed. (Also, it
seems that Sphynx has good plugins and support for converting to other
formats.)

Having worked on the Sphinx docs a lot, I have a fairly good idea of the
extent of what Sphinx can do, and I can envision an easy-to-maintain way to
handle these docs. Using reST "field lists" <
http://docutils.sourceforge.net/docs/user/rst/quickref.html#field-lists>
and other reST features, you could capture it in a "highly semantic" form
like this:

:Header: x86intrin.h
:Prototype: __m256 _mm256_round_ps(__m256 v, const int m);
:Instruction: VROUNDPS
:Description:
   Rounds the values stored in a packed 256-bit vector [8 x float] as
   specified by the byte operand. The source values are rounded to integer
   values and returned as floating point values.
:Parameters:
   v
      A 256-bit vector of [8 x float] values.
   m
      An immediate byte operand specifying how the rounding is to be
performed.
      Bits [7:4] are reserved.
      Bit [3] is a precision exception value:
                      0: A normal PE exception is used
                      1: The PE field is not updated
      Bit [2] is a rounding control source:
                      0: MXCSR:RC
                      1: Use the RC field value
      Bit [1:0] contain the rounding control definition:
                      00: Nearest
                      01: Downward (toward negative infinity)
                      10: Upward (toward positive infinity)
                      11: Truncated
:Returns:
   A 256-bit vector of [8 x float] containing the rounded values.

Most of the structure in the above snippet will be represented in the
docutils XML representation. I'm not very familiar with Doxygen markup, but
my understanding is that it wouldn't be possible to semantically capture
the equivalent of the :Header: field (unless something like that is already
hardcoded into doxygen). I use the term "semantically" to indicate that
it's not just a piece of text marked up to be formatted in a specific way,
but instead associates a specific semantic intent to the text, which can
then be programmatically extracted/manipulated.

I'm not sure how important a "semantic" representation is though. What sort
of information needs to be emitted for the Microsoft tooltip format? Is it
just a blob of arbitrary formatted text, or does it actually require
separating things like "which header should be included to get this
definition"?

-- Sean Silva

Just to add another possibility if you decide to use Doxygen-style comments.

Doug Gregor and others at Boost have developed a Doxygen interface from Boost's Quickbook to provide
a C++ reference section containing all the info in the Doxygen comments

\tparam, \param, \returns ...

Quickbook gives you most of the format control, automated hyperlinks, syntax colouring ... in your
textual documentation like tutorials, that you are likely to want.

Snippets of code can in included, so you can ensure that the code examples in the docs have been
compiled and run.

Lots of the biggest and newest libraries in Boost use this system and I think they look very nice.

So providing the Doxygen comments are there (the big task?), you can process them downstream in more
than one way.

HTH

Paul

Thanks to everyone on this thread who replied to our query about what approach to take for our Clang documentation. Your thoughtful comments were very helpful to our understanding of the issues and of the preferences of the community. We’ve reviewed your responses to the options that we proposed, and we’ve decided to pursue the following option:

1a) Maintain the documentation in Doxygen. Any other formats required (HTML, XML, reST/Sphynx, and so on) can be generated on-the-fly from the Doxygen comments as needed.

We have a good idea about how to proceed with this option, and we expect that we can begin submitting our updated Doxygen documentation later this summer.

Cheers,

Michael

This will be awesome. Thank you for all of your work!

-eric