libclang interface for comments AST

Hello,

I am designing a libclang interface to expose comments ASTs. Attached
is a patch for Index.h.

The API has node traversal APIs and APIs to get various properties of the nodes.

I am not sure about error handling. For example,
clang_InlineCommandComment_getCommandName is not applicable to a
CXComment_Text AST node. I think that this should abort() right away,
but I don't see any other libclang APIs doing that -- they try to
return some bogus value instead.

I just want to get some general feedback if this is the right
direction and if there is something obviously missing.

Dmitri

libclang-comments-interface.patch (9.68 KB)

This is awesome, just what I need.

About the error handling. It seems like most functions in libclang that returns a string will just return an empty string when it can't return a valid string.

BTW, why is CXComment a struct and not just a typedef? The rest of the libclang API seems to use typedefs to void* when there's only a void*. Some parts of the API, like CXString, uses structs but that's when they also contain some extra data in addition to the void*.

If you're going with the struct use all lowercase for the variable name(s) in the struct to be consistent with the rest of the API.

Hi Jacob!

Thank you for reviewing!

About the error handling. It seems like most functions in libclang that
returns a string will just return an empty string when it can't return a
valid string.

I agree that it is easy to make up a string with an empty value. But
what about clang_ParamCommandComment_getDirection, for example? There
is no "neutral" parameter passing direction.

BTW, why is CXComment a struct and not just a typedef? The rest of the
libclang API seems to use typedefs to void* when there's only a void*.
Some parts of the API, like CXString, uses structs but that's when they
also contain some extra data in addition to the void*.

The design is not final (and the API is not fully implemented), so I
didn't know if I will need anything else except void *Data.

If you're going with the struct use all lowercase for the variable
name(s) in the struct to be consistent with the rest of the API.

The API is inconsistent in itself: struct CXUnsavedFile, struct
CXVersion have members starting with uppercase letter, some other
structs use lowercase. Of these two styles I followed general clang
style (which is a better thing to do).

Dmitri

Hello,

I am designing a libclang interface to expose comments ASTs. Attached
is a patch for Index.h.

The API has node traversal APIs and APIs to get various properties of the nodes.

I am not sure about error handling. For example,
clang_InlineCommandComment_getCommandName is not applicable to a
CXComment_Text AST node. I think that this should abort() right away,
but I don't see any other libclang APIs doing that -- they try to
return some bogus value instead.

Yeah, we should return a safe-but-meaningless value from such mistakes. Clients of libclang aren't necessarily as rigorously tested as Clang itself, so flushing out bugs via assert()s or abort()s, the way we do in the LLVM code base, doesn't work well in practice.

I just want to get some general feedback if this is the right
direction and if there is something obviously missing.

/**
+ * \brief Given a cursor that represents a declaration, return the associated
+ * \c CXComment_FullComment AST node for the whole comment.
+ */
+CINDEX_LINKAGE CXComment clang_Cursor_getFullComment(CXCursor C);

Hi Doug!

Thank you for looking at this!

I am not sure about error handling. For example,
clang_InlineCommandComment_getCommandName is not applicable to a
CXComment_Text AST node. I think that this should abort() right away,
but I don't see any other libclang APIs doing that -- they try to
return some bogus value instead.

Yeah, we should return a safe-but-meaningless value from such mistakes. Clients of libclang aren't necessarily as rigorously tested as Clang itself, so flushing out bugs via assert()s or abort()s, the way we do in the LLVM code base, doesn't work well in practice.

OK.

/**
+ * \brief Given a cursor that represents a declaration, return the associated
+ * \c CXComment_FullComment AST node for the whole comment.
+ */
+CINDEX_LINKAGE CXComment clang_Cursor_getFullComment(CXCursor C);
+
+/**
  * @}
  */

Presumably, this should eventually work for macro definitions as well.

Changed comment to:

/**
* \brief Given a cursor that represents a documentable entity (e.g.,
* declaration), return the associated parsed comment as a
* \c CXComment_FullComment AST node.
*/

I think the name "full" comment is not specific enough. How about calling this a "parsed" comment, because that's what is actually returned? (And it's a good counter to the "raw" comment).

Renamed to clang_Cursor_getParsedComment.

+CINDEX_LINKAGE
+CXString clang_InlineCommandComment_getCommandName(CXComment Comment);
<snip>
+CINDEX_LINKAGE
+CXString clang_BlockCommandComment_getCommandName(CXComment Comment);

Why have both of these, rather than simply

CINDEX_LINKAGE
CXString clang_Comment_getCommandName(CXComment Comment);

?

Similar comment for clang_*Comment_getNumArgs/clang_*Comment_getArgText

So that this function would internally determine the type of the AST
node and "just do the right thing"? This, combined with ignoring AST
node type mismatches and making up "safe" values, might create some
issues with understanding the API and lead to API misuse.

Also, there are a number of clang_*Comment_getText() calls.

For example, someone might apply this clang_Comment_getText() call to
a FullComment and expect it to return a comment represented as plain
text (somehow, with or without HTML and Doxygen markup, word-wrapped
to fit 80 cols or not -- expectations can be very different). But,
unless we implement this AST traversal, we will just return an empty
value because we will call getText() only for AST nodes that have
getText() call themselves.

Or implementing getText() to work universally is the way to go?

Perhaps there should just be a single clang_Comment_getText() that works for the various comment kinds? The main outlier is

+/**
+ * \param Comment a \c CXComment_VerbatimBlockCommand AST node.
+ *
+ * \param LineIdx verbatim text line index (zero-based).
+ *
+ * \returns text contained in the specified line.
+ */
+CINDEX_LINKAGE
+CXString clang_VerbatimBlockComment_getText(CXComment Comment,
+ unsigned LineIdx);

and I'm wondering why one doesn't simply get the CXComment_VerbatimBlockLine via clang_Comment_getNumChildren/clang_Comment_getChild?

It is just a convenience function. It can be easily removed.

This group of functions:

+CINDEX_LINKAGE CXString clang_HTMLTagComment_getTagName(CXComment Comment);
+CINDEX_LINKAGE
+unsigned clang_HTMLStartTagComment_isSelfClosing(CXComment Comment);
+CINDEX_LINKAGE unsigned clang_HTMLStartTag_getNumAttrs(CXComment Comment);
+CINDEX_LINKAGE
+CXString clang_HTMLStartTag_getAttrName(CXComment Comment, unsigned AttrIdx);
+CINDEX_LINKAGE
+CXString clang_HTMLStartTag_getAttrValue(CXComment Comment, unsigned AttrIdx);

implies that a client will have to reconstruct HTML tags on its own, which seems like it's going to cause a lot of redundant (and probably wrong) code in clients. Can the aforementioned clang_Comment_getText() do the right thing for CXComment_HTMLStartTag and CXComment_HTMLEndTag nodes, by formatting the HTML tag appropriately? That would simplify a number of clients.

It surely makes sense to have such API, but see my concerns about
general getText() API.

Dmitri

Hi Doug!

Thank you for looking at this!

+CINDEX_LINKAGE
+CXString clang_InlineCommandComment_getCommandName(CXComment Comment);
<snip>
+CINDEX_LINKAGE
+CXString clang_BlockCommandComment_getCommandName(CXComment Comment);

Why have both of these, rather than simply

CINDEX_LINKAGE
CXString clang_Comment_getCommandName(CXComment Comment);

?

Similar comment for clang_*Comment_getNumArgs/clang_*Comment_getArgText

So that this function would internally determine the type of the AST
node and "just do the right thing"? This, combined with ignoring AST
node type mismatches and making up "safe" values, might create some
issues with understanding the API and lead to API misuse.

Perhaps.

Also, there are a number of clang_*Comment_getText() calls.

For example, someone might apply this clang_Comment_getText() call to
a FullComment and expect it to return a comment represented as plain
text (somehow, with or without HTML and Doxygen markup, word-wrapped
to fit 80 cols or not -- expectations can be very different). But,
unless we implement this AST traversal, we will just return an empty
value because we will call getText() only for AST nodes that have
getText() call themselves.

Or implementing getText() to work universally is the way to go?

I generally prefer fewer entry points, so that clients that have to switch() on all of the enum values can at the very least combine cases for obvious things (say, if they don't care about adding special formatting for HTML or other inlines). The general philosophy I have is that simple clients should be able to do simple things, but the information should be there to do more complicated things.

But, I completely see your point about this potentially causing confusion with the API, and we have seen some confusion with other libclang APIs that skew toward fewer, more general entry points. We can try it your way :slight_smile:

Perhaps there should just be a single clang_Comment_getText() that works for the various comment kinds? The main outlier is

+/**
+ * \param Comment a \c CXComment_VerbatimBlockCommand AST node.
+ *
+ * \param LineIdx verbatim text line index (zero-based).
+ *
+ * \returns text contained in the specified line.
+ */
+CINDEX_LINKAGE
+CXString clang_VerbatimBlockComment_getText(CXComment Comment,
+ unsigned LineIdx);

and I'm wondering why one doesn't simply get the CXComment_VerbatimBlockLine via clang_Comment_getNumChildren/clang_Comment_getChild?

It is just a convenience function. It can be easily removed.

Okay. I'd prefer removal, but would also be fine with commenting that these CXComment_VerbatimBlockLine comments will also be part of the children of the node, so users don't think they have to traverse both.

Aside: I notice that most of your comments for these APIs don't have any 'brief' paragraph; they simply have \param and \returns clauses.

This group of functions:

+CINDEX_LINKAGE CXString clang_HTMLTagComment_getTagName(CXComment Comment);
+CINDEX_LINKAGE
+unsigned clang_HTMLStartTagComment_isSelfClosing(CXComment Comment);
+CINDEX_LINKAGE unsigned clang_HTMLStartTag_getNumAttrs(CXComment Comment);
+CINDEX_LINKAGE
+CXString clang_HTMLStartTag_getAttrName(CXComment Comment, unsigned AttrIdx);
+CINDEX_LINKAGE
+CXString clang_HTMLStartTag_getAttrValue(CXComment Comment, unsigned AttrIdx);

implies that a client will have to reconstruct HTML tags on its own, which seems like it's going to cause a lot of redundant (and probably wrong) code in clients. Can the aforementioned clang_Comment_getText() do the right thing for CXComment_HTMLStartTag and CXComment_HTMLEndTag nodes, by formatting the HTML tag appropriately? That would simplify a number of clients.

It surely makes sense to have such API, but see my concerns about
general getText() API.

I'd at least like to have this API available.

  - Doug

I don't agree with this. Isn't the job of libclang to just provide the data and then it's up to the client what to do with it?

I wouldn't want to get HTML back, I'm just interested in the data. I have other plans for the data than transforming it to HTML.

What about providing API for doing both?

I agree that it is easy to make up a string with an empty value. But
what about clang_ParamCommandComment_getDirection, for example? There
is no "neutral" parameter passing direction.

Just add a new member to the enum representing an invalid value. Just as you've done with CXCommentKind. Call it CXCommentParamPassDirection_Null, CXCommentParamPassDirection_Invalid or something similar.

Another option, that always works, is to use an out parameter.

int clang_ParamCommandComment_getDirection(CXComment Comment, enum CXCommentParamPassDirection* out);

Or the other way around:

enum CXCommentParamPassDirection clang_ParamCommandComment_getDirection(CXComment Comment, int* success);

Although this doesn't really fit with the rest of the API.

The design is not final (and the API is not fully implemented), so I
didn't know if I will need anything else except void *Data.

Fair enough.

The API is inconsistent in itself: struct CXUnsavedFile, struct
CXVersion have members starting with uppercase letter, some other
structs use lowercase. Of these two styles I followed general clang
style (which is a better thing to do).

Dmitri

These are the only structs that have names starting with uppercase:

CXCompletionResult
CXCodeCompleteResults
CXUnsavedFile
CXVersion

BTW, it seems that libclang has its own style, which I see no problem with since it's C not C++.

I intended to suggest that we want both APIs.

  - Doug

Hi Doug!

Thank you for looking at this!

+CINDEX_LINKAGE
+CXString clang_InlineCommandComment_getCommandName(CXComment Comment);
<snip>
+CINDEX_LINKAGE
+CXString clang_BlockCommandComment_getCommandName(CXComment Comment);

Why have both of these, rather than simply

CINDEX_LINKAGE
CXString clang_Comment_getCommandName(CXComment Comment);

?

Similar comment for clang_*Comment_getNumArgs/clang_*Comment_getArgText

So that this function would internally determine the type of the AST
node and "just do the right thing"? This, combined with ignoring AST
node type mismatches and making up "safe" values, might create some
issues with understanding the API and lead to API misuse.

Perhaps.

Also, there are a number of clang_*Comment_getText() calls.

For example, someone might apply this clang_Comment_getText() call to
a FullComment and expect it to return a comment represented as plain
text (somehow, with or without HTML and Doxygen markup, word-wrapped
to fit 80 cols or not -- expectations can be very different). But,
unless we implement this AST traversal, we will just return an empty
value because we will call getText() only for AST nodes that have
getText() call themselves.

Or implementing getText() to work universally is the way to go?

I generally prefer fewer entry points, so that clients that have to switch() on all of the enum values can at the very least combine cases for obvious things (say, if they don't care about adding special formatting for HTML or other inlines). The general philosophy I have is that simple clients should be able to do simple things, but the information should be there to do more complicated things.

But, I completely see your point about this potentially causing confusion with the API, and we have seen some confusion with other libclang APIs that skew toward fewer, more general entry points. We can try it your way :slight_smile:

Don't get me wrong, I see how a general getText() call is a useful
API. But I think it would be better if it would be a separate API
call (for example, getAsPlainText()) that is designed (and expected by
users) to do all the required magic. I see these various *_getText()
APIs as property getters for specific node kinds, low-level building
blocks without any magic to facilitate building other tools on top of
libclang (for example, implement some other way to convert a comment
to plain text).

Perhaps there should just be a single clang_Comment_getText() that works for the various comment kinds? The main outlier is

+/**
+ * \param Comment a \c CXComment_VerbatimBlockCommand AST node.
+ *
+ * \param LineIdx verbatim text line index (zero-based).
+ *
+ * \returns text contained in the specified line.
+ */
+CINDEX_LINKAGE
+CXString clang_VerbatimBlockComment_getText(CXComment Comment,
+ unsigned LineIdx);

and I'm wondering why one doesn't simply get the CXComment_VerbatimBlockLine via clang_Comment_getNumChildren/clang_Comment_getChild?

It is just a convenience function. It can be easily removed.

Okay. I'd prefer removal, but would also be fine with commenting that these CXComment_VerbatimBlockLine comments will also be part of the children of the node, so users don't think they have to traverse both.

Removed to reduce confusion.

Aside: I notice that most of your comments for these APIs don't have any 'brief' paragraph; they simply have \param and \returns clauses.

\brief would be a copy of \returns -- these functions are just getters
for various node properties. I can rephrase \returns as \brief and
remove \returns.

This group of functions:

+CINDEX_LINKAGE CXString clang_HTMLTagComment_getTagName(CXComment Comment);
+CINDEX_LINKAGE
+unsigned clang_HTMLStartTagComment_isSelfClosing(CXComment Comment);
+CINDEX_LINKAGE unsigned clang_HTMLStartTag_getNumAttrs(CXComment Comment);
+CINDEX_LINKAGE
+CXString clang_HTMLStartTag_getAttrName(CXComment Comment, unsigned AttrIdx);
+CINDEX_LINKAGE
+CXString clang_HTMLStartTag_getAttrValue(CXComment Comment, unsigned AttrIdx);

implies that a client will have to reconstruct HTML tags on its own, which seems like it's going to cause a lot of redundant (and probably wrong) code in clients. Can the aforementioned clang_Comment_getText() do the right thing for CXComment_HTMLStartTag and CXComment_HTMLEndTag nodes, by formatting the HTML tag appropriately? That would simplify a number of clients.

It surely makes sense to have such API, but see my concerns about
general getText() API.

I'd at least like to have this API available.

Of course, just with a separate function name.

Dmitri

Sure, both APIs should be available.

Dmitri

Okay, that makes sense. I don't think we need to do getAsPlainText() now.

  - Doug

I agree that it is easy to make up a string with an empty value. But
what about clang_ParamCommandComment_getDirection, for example? There
is no "neutral" parameter passing direction.

Just add a new member to the enum representing an invalid value. Just as
you've done with CXCommentKind. Call it
CXCommentParamPassDirection_Null, CXCommentParamPassDirection_Invalid or
something similar.

Another option, that always works, is to use an out parameter.

int clang_ParamCommandComment_getDirection(CXComment Comment, enum
CXCommentParamPassDirection* out);

Or the other way around:

enum CXCommentParamPassDirection
clang_ParamCommandComment_getDirection(CXComment Comment, int* success);

Although this doesn't really fit with the rest of the API.

Or simply default to CXCommentParamPassDirection_In and not worry about broken clients getting safe (but meaningless) information.

The API is inconsistent in itself: struct CXUnsavedFile, struct
CXVersion have members starting with uppercase letter, some other
structs use lowercase. Of these two styles I followed general clang
style (which is a better thing to do).

Dmitri

These are the only structs that have names starting with uppercase:

CXCompletionResult
CXCodeCompleteResults
CXUnsavedFile
CXVersion

BTW, it seems that libclang has its own style, which I see no problem
with since it's C not C++.

I think it makes sense to follow LLVM conventions going forward, even though some of the older structs in libclang don't follow those conventions.

  - Doug