[PATCH] libclang interface for comments AST

Hello,

The attached patch adds libclang APIs to walk comments ASTs and an API
to convert a comment to an HTML fragment.

I implemented error handling as returning bogus-but-safe values. I am
still concerned about this, because if the programmer misuses the API,
libclang will work around that silently. It could be a good idea to
add some opt-in debug output for libclang to scream in such cases.

For testing I implemented an equivalent of Comment::dump() with these
new APIs in c-index-test.

Please review.

Dmitri

libclang-comments-interface-v2.patch (63 KB)

I think it still would be better to return an invalid value where possible. For example, for functions returning CXCommentParamPassDirection, return CXCommentParamPassDirection_Null where no legal value can be return instead of CXCommentParamPassDirection_In.

This is what I think the functions should be returning to indicate an invalid value, depending on their type:

* enum - Do as above
* string - Empty or null string
* integers - Negative number, if possible
* pointers - Null
* structs - Usually contains an enum and/or pointer

* floating point - Perhaps NaN? Don't really know how that works in C/C++

* char - This really depends on what encoding is assumed. For 7bit ASCII and UTF-8 you probably can return anything above 127. Although I can't remember I've seen a libclang function returning a char so it might not be a problem.

Hello,

The attached patch adds libclang APIs to walk comments ASTs and an API
to convert a comment to an HTML fragment.

This is awesome. Some minor comments below, but I think this is ready to go in.

I implemented error handling as returning bogus-but-safe values. I am
still concerned about this, because if the programmer misuses the API,
libclang will work around that silently. It could be a good idea to
add some opt-in debug output for libclang to scream in such cases.

I think adding opt-in debug output for libclang is a wonderful (separable) idea.

For testing I implemented an equivalent of Comment::dump() with these
new APIs in c-index-test.

Great!

+void CommentASTToHTMLConverter::visitInlineCommandComment(
+ const InlineCommandComment *C) {
+ StringRef CommandName = C->getCommandName();
+ bool HasArg0 = C->getNumArgs() > 0 && !C->getArgText(0).empty();
+ StringRef Arg0;
+ if (HasArg0)
+ Arg0 = C->getArgText(0);

While you're working on comment parser, you may be interested by this bug:

http://llvm.org/bugs/show_bug.cgi?id=13411

This is a comment that make the comment parser crash (by assertion)

I'm really looking forward to this.

Fixed in r160568.

@code, according to Doxygen, is a starting command of a verbatim code block
(clearly not something that comment author intended -- seems like they didn't
use Doxygen).

The assertion was wrong in case we have a verbatim block without a closing
command.

Dmitri

Hello,

The attached patch adds libclang APIs to walk comments ASTs and an API
to convert a comment to an HTML fragment.

This is awesome. Some minor comments below, but I think this is ready to go in.

Committed r160577.

I implemented error handling as returning bogus-but-safe values. I am
still concerned about this, because if the programmer misuses the API,
libclang will work around that silently. It could be a good idea to
add some opt-in debug output for libclang to scream in such cases.

I think adding opt-in debug output for libclang is a wonderful (separable) idea.

OK, will think about it.

+void CommentASTToHTMLConverter::visitInlineCommandComment(
+ const InlineCommandComment *C) {
+ StringRef CommandName = C->getCommandName();
+ bool HasArg0 = C->getNumArgs() > 0 && !C->getArgText(0).empty();
+ StringRef Arg0;
+ if (HasArg0)
+ Arg0 = C->getArgText(0);
+
+ if (CommandName == "b") {
+ if (!HasArg0)
+ return;
+ Result += "<b>";
+ Result += Arg0;
+ Result += "</b>";
+ return;
+ }
+ if (CommandName == "c" || CommandName == "p") {
+ if (!HasArg0)
+ return;
+ Result += "<tt>";
+ Result += Arg0;
+ Result += "</tt>";
+ return;
+ }
+ if (CommandName == "a" || CommandName == "e" || CommandName == "em") {
+ if (!HasArg0)
+ return;
+ Result += "<em>";
+ Result += Arg0;
+ Result += "</em>";
+ return;
+ }

Here, we're classifying a subset of the Doxygen inline commands. Can you extract this operation out into a member function of InlineCommandComment, something like

        enum CXInlineCommandCommentKind {
                ICC_Unknown,
                ICC_Bold,
                ICC_Code,
                ICC_Emphasized
        };

        CXInlineCommandCommentKind getCommandCommentKind() const;

or maybe tie it to the rendering of the text, e.g.,

        enum CXInlineCommandRenderKind {
                ICR_Normal,
                ICR_Bold,
                ICR_Code,
                ICR_Emphasized
        }

        CXInlineCommandRenderKind getRenderKind() const;

so that all clients don't need to reinterpret the various Doxygen/HeaderDoc/etc. commands themselves (Unless they want to)? This information would be useful in the libclang API as well, since we expect many clients to use that.

Agreed. I like the second option better because Doxygen manual
assigns some semantic difference to the commands which are rendered
the same way. For example, \c is "anything that looks like code", but
\p is "parameter name".

I will do it as a follow-up because I think it makes sense to make
CommentSema responsible for such analysis and touching that is a
little out of scope for this patch.

+/**
+ * \brief Convert a given full parsed comment to an HTML fragment.
+ *
+ * Specific details of HTML layout are subject to change. Don't try to parse
+ * this HTML back into an AST, use other APIs instead.
+ *
+ * \param Comment a \c CXComment_FullComment AST node.
+ *
+ * \returns string containing an HTML fragment.
+ */
+CINDEX_LINKAGE CXString clang_FullComment_getAsHTML(CXComment Comment);

Somewhere, we'll need to document the various classes that the emitted HTML uses (para-brief, para-returns, etc.). I suspect that there might be specific interest in those classes, since that's a (currently hidden, but very important) part of this interface.

Added more comments to clang_FullComment_getAsHTML:

+ * Currently the following CSS classes are used:
+ * \li "para-brief" for \\brief paragraph and equivalent commands;
+ * \li "para-returns" for \\returns paragraph and equivalent commands;
+ * \li "word-returns" for the "Returns" word in \\returns paragraph.
+ *
+ * Argument list is rendered as \<dl\> list with arguments sorted in function
+ * prototype order.

Dmitri