[RFC] Removing libclang APIs to traverse the comment AST

Hello,

I'd like to remove libclang APIs to traverse the comment AST.
Instead, I would recommend to use clang_FullComment_getAsXML, which
provides the same information, but does not expose implementation
details of the comment AST.

Here is the list of libclang APIs I would like to remove:

enum CXCommentKind
enum CXCommentInlineCommandRenderKind
enum CXCommentParamPassDirection

clang_Comment_getKind
clang_Comment_getNumChildren
clang_Comment_getChild
clang_Comment_isWhitespace
clang_InlineContentComment_hasTrailingNewline
clang_TextComment_getText
clang_InlineCommandComment_getCommandName
clang_InlineCommandComment_getRenderKind
clang_InlineCommandComment_getNumArgs
clang_InlineCommandComment_getArgText
clang_HTMLTagComment_getTagName
clang_HTMLStartTagComment_isSelfClosing
clang_HTMLStartTag_getNumAttrs
clang_HTMLStartTag_getAttrName
clang_HTMLStartTag_getAttrValue
clang_BlockCommandComment_getCommandName
clang_BlockCommandComment_getNumArgs
clang_BlockCommandComment_getArgText
clang_BlockCommandComment_getParagraph
clang_BlockCommandComment_getCommentWithIndex
clang_ParamCommandComment_getParamName
clang_ParamCommandComment_isParamIndexValid
clang_ParamCommandComment_getParamIndex
clang_ParamCommandComment_isDirectionExplicit
clang_ParamCommandComment_getDirection
clang_TParamCommandComment_getParamName
clang_TParamCommandComment_isParamPositionValid
clang_TParamCommandComment_getDepth
clang_TParamCommandComment_getIndex
clang_VerbatimBlockLineComment_getText
clang_VerbatimLineComment_getText

The reason to remove all these APIs is simple: providing these APIs
does not allow us to change the underlying AST easily.

Here's the plan:
Clang 3.4 -- the APIs are deprecated.
Clang 3.5 -- the APIs are removed.

Please let me know what you think.

Dmitri

Do we have no users? :frowning:

This is a huge API surface area that is being ripped out. The cries of even a single user that depends on this will forever shatter all confidence in LLVM’s C API stability promise. (also, btw, I wouldn’t expect a user of a stable API to be subscribed to the developer’s mailing list of the project…; that is a key difference between a stable API and a “no backwards compatibility; the onus is on you to be aware of what is happening in the codebase” API).

If these API’s are deprecated (later to be removed) please write a migration guide to help existing users transition to the new recommended approach (also, it would be nice to have some documentation about what the recommended approach even is!). Can you estimate the expected development cost required to transition from the old API to the recommended API? (for example, what if bunding an XML library is not feasible for a user that depends on this functionality?)

more comments below,

Do we have no users? :frowning:

This is a huge API surface area that is being ripped out. The cries of
even a single user that depends on this will forever shatter all confidence
in LLVM's C API stability promise. (also, btw, I wouldn't expect a user of
a stable API to be subscribed to the developer's mailing list of the
project...; that is a key difference between a stable API and a "no
backwards compatibility; the onus is on you to be aware of what is
happening in the codebase" API).

+1
I'm the user of that API... (yep, I'm subscribed to the dev list :slight_smile:

If these API's are deprecated (later to be removed) please write a
migration guide to help existing users transition to the new recommended
approach (also, it would be nice to have some documentation about what the
recommended approach even is!). Can you estimate the expected development
cost required to transition from the old API to the recommended API? (for
example, what if bunding an XML library is not feasible for a user that
depends on this functionality?)

agreed. personally I don't want to link w/ any XML lib... it was quite
enough (moreover, it is desired) to have an access to plain text
comments... I don't want any XML, cuz extracted comments get indexed for
further full-text search... and having XML here is completely harmful and
undesired...

I am not currently a user of the comment API, but I object to the general direction you appear to be taking. Maybe I've just been scarred by too many bad experiences with trying to parse badly thought out XML documents, but I'd *much* rather have a plain C API rather than an XML only one.

Also, given that this is a public API which has been communicated to be fairly stable, I'm concerned by the fact that you appear to be removing large chunks. Can you provide a concrete reason for wanting to remove these APIs? What actual change is it blocking?

p.s. Thanks for bringing these up on the llvmdev list rather than having them get lost in llvm-commits traffic.

Yours,
Philip

Do we have no users? :frowning:

This is a huge API surface area that is being ripped out. The cries of
even a single user that depends on this will forever shatter all confidence
in LLVM's C API stability promise. (also, btw, I wouldn't expect a user of a
stable API to be subscribed to the developer's mailing list of the
project...; that is a key difference between a stable API and a "no
backwards compatibility; the onus is on you to be aware of what is happening
in the codebase" API).

+1
I'm the user of that API... (yep, I'm subscribed to the dev list :slight_smile:

Hi Alex,

Thank you for a quick reply. It is unexpected for me that these APIs
have users.

If these API's are deprecated (later to be removed) please write a
migration guide to help existing users transition to the new recommended
approach (also, it would be nice to have some documentation about what the
recommended approach even is!). Can you estimate the expected development
cost required to transition from the old API to the recommended API? (for
example, what if bunding an XML library is not feasible for a user that
depends on this functionality?)

agreed. personally I don't want to link w/ any XML lib... it was quite
enough (moreover, it is desired) to have an access to plain text comments...
I don't want any XML, cuz extracted comments get indexed for further
full-text search... and having XML here is completely harmful and
undesired...

You can still extract plain text from XML pretty easily, without
linking to an XML library. While *parsing* XML requires a real
parser, stripping tags can be done with with a regexp. Or you could
use the C++ API. Does any of these work for you?

Dmitri

Hello,

I'd like to remove libclang APIs to traverse the comment AST.
Instead, I would recommend to use clang_FullComment_getAsXML, which
provides the same information, but does not expose implementation
details of the comment AST.

I am not currently a user of the comment API, but I object to the general
direction you appear to be taking. Maybe I've just been scarred by too many
bad experiences with trying to parse badly thought out XML documents, but
I'd *much* rather have a plain C API rather than an XML only one.

I hope that our comment XML API is well-designed. We have a schema,
we have a lot of tests, and we verify that all our output conforms to
the schema. Did you have a chance to take a look at the schema and
example output? Please tell me what you think and if anything looks
wrong.

Also, given that this is a public API which has been communicated to be
fairly stable, I'm concerned by the fact that you appear to be removing
large chunks.

Yes, this fact actually concerns me. This is why I started this
discussion on the mailing list.

The original reason to provide these APIs was to implement a testing
mode for comments in c-index-test. These APIs were not meant for
other uses, but I see that I failed to communicate that in the
documentation at all.

Can you provide a concrete reason for wanting to remove these
APIs? What actual change is it blocking?

I think this is a separate issue, but here are some details.

I'll be sending an RFC for this change as well, because I think that
this is a major parsing *model* change. It does not affect the
resulting output in normal cases, but will do what the user intended
in more unusual cases.

On the AST side, it boils down to changing BlockCommandComment to
contain multiple paragraphs inside, which will allow commands like
\param to have multi-paragraph descriptions. On libclang side, this
change, for example, makes clang_BlockCommandComment_getParagraph()
return incomplete results -- just the first paragraph.

Dmitri

>
>>
>> Do we have no users? :frowning:
>>
>> This is a huge API surface area that is being ripped out. The cries of
>> even a single user that depends on this will forever shatter all
confidence
>> in LLVM's C API stability promise. (also, btw, I wouldn't expect a user
of a
>> stable API to be subscribed to the developer's mailing list of the
>> project...; that is a key difference between a stable API and a "no
>> backwards compatibility; the onus is on you to be aware of what is
happening
>> in the codebase" API).
>
> +1
> I'm the user of that API... (yep, I'm subscribed to the dev list :slight_smile:

Hi Alex,

Thank you for a quick reply. It is unexpected for me that these APIs
have users.

And not every user is necessarily going to show up on this thread. The C
API is our *one* stability promise; please don't break it, or else it will
break all confidence in our promise!

>> If these API's are deprecated (later to be removed) please write a
>> migration guide to help existing users transition to the new recommended
>> approach (also, it would be nice to have some documentation about what
the
>> recommended approach even is!). Can you estimate the expected
development
>> cost required to transition from the old API to the recommended API?
(for
>> example, what if bunding an XML library is not feasible for a user that
>> depends on this functionality?)
>
> agreed. personally I don't want to link w/ any XML lib... it was quite
> enough (moreover, it is desired) to have an access to plain text
comments...
> I don't want any XML, cuz extracted comments get indexed for further
> full-text search... and having XML here is completely harmful and
> undesired...

You can still extract plain text from XML pretty easily, without
linking to an XML library. While *parsing* XML requires a real
parser, stripping tags can be done with with a regexp.

Please don't suggest this. Regexp's are not a solution for working with
XML. Seriously.

-- Sean Silva

>>
>> Hello,
>>
>> I'd like to remove libclang APIs to traverse the comment AST.
>> Instead, I would recommend to use clang_FullComment_getAsXML, which
>> provides the same information, but does not expose implementation
>> details of the comment AST.
>
> I am not currently a user of the comment API, but I object to the general
> direction you appear to be taking. Maybe I've just been scarred by too
many
> bad experiences with trying to parse badly thought out XML documents, but
> I'd *much* rather have a plain C API rather than an XML only one.

I hope that our comment XML API is well-designed. We have a schema,
we have a lot of tests, and we verify that all our output conforms to
the schema. Did you have a chance to take a look at the schema and
example output? Please tell me what you think and if anything looks
wrong.

> Also, given that this is a public API which has been communicated to be
> fairly stable, I'm concerned by the fact that you appear to be removing
> large chunks.

Yes, this fact actually concerns me. This is why I started this
discussion on the mailing list.

The original reason to provide these APIs was to implement a testing
mode for comments in c-index-test. These APIs were not meant for
other uses, but I see that I failed to communicate that in the
documentation at all.

Why did you even put them in the header *and the exports list* if they
aren't public API?

> Can you provide a concrete reason for wanting to remove these
> APIs? What actual change is it blocking?

I think this is a separate issue, but here are some details.

I'll be sending an RFC for this change as well, because I think that
this is a major parsing *model* change. It does not affect the
resulting output in normal cases, but will do what the user intended
in more unusual cases.

On the AST side, it boils down to changing BlockCommandComment to
contain multiple paragraphs inside, which will allow commands like
\param to have multi-paragraph descriptions. On libclang side, this
change, for example, makes clang_BlockCommandComment_getParagraph()
return incomplete results -- just the first paragraph.

It seems like all they can do is get the text of the paragraph, so why not
just fake up a dummy paragraph whose text content is the content of all the
paragraphs? and document this restriction and point the user to a new API
that gives the full access to the paragraphs? I'm also really worried that
this will also change the XML schema in an incompatible way: the XML schema
is by extension part of the stability promise of the C API, and can't be
changed incompatibly without breaking users.

-- Sean Silva

The original reason to provide these APIs was to implement a testing
mode for comments in c-index-test. These APIs were not meant for
other uses, but I see that I failed to communicate that in the
documentation at all.

Why did you even put them in the header *and the exports list* if they
aren't public API?

As I said, not communicating this is my fault. I just added these
APIs the same way as we usually add APIs to libclang.

> Can you provide a concrete reason for wanting to remove these
> APIs? What actual change is it blocking?

I think this is a separate issue, but here are some details.

I'll be sending an RFC for this change as well, because I think that
this is a major parsing *model* change. It does not affect the
resulting output in normal cases, but will do what the user intended
in more unusual cases.

On the AST side, it boils down to changing BlockCommandComment to
contain multiple paragraphs inside, which will allow commands like
\param to have multi-paragraph descriptions. On libclang side, this
change, for example, makes clang_BlockCommandComment_getParagraph()
return incomplete results -- just the first paragraph.

It seems like all they can do is get the text of the paragraph, so why not
just fake up a dummy paragraph whose text content is the content of all the
paragraphs? and document this restriction and point the user to a new API
that gives the full access to the paragraphs? I'm also really worried that
this will also change the XML schema in an incompatible way: the XML schema
is by extension part of the stability promise of the C API, and can't be
changed incompatibly without breaking users.

Or we could only return the first paragraph. Faking a paragraph
requires creating AST nodes, which requires memory allocation on the
ASTContext, which is permanent.

The schema change is:

Index: bindings/xml/comment-xml-schema.rng

>
>> The original reason to provide these APIs was to implement a testing
>> mode for comments in c-index-test. These APIs were not meant for
>> other uses, but I see that I failed to communicate that in the
>> documentation at all.
>
> Why did you even put them in the header *and the exports list* if they
> aren't public API?

As I said, not communicating this is my fault. I just added these
APIs the same way as we usually add APIs to libclang.

>> > Can you provide a concrete reason for wanting to remove these
>> > APIs? What actual change is it blocking?
>>
>> I think this is a separate issue, but here are some details.
>>
>> I'll be sending an RFC for this change as well, because I think that
>> this is a major parsing *model* change. It does not affect the
>> resulting output in normal cases, but will do what the user intended
>> in more unusual cases.
>>
>> On the AST side, it boils down to changing BlockCommandComment to
>> contain multiple paragraphs inside, which will allow commands like
>> \param to have multi-paragraph descriptions. On libclang side, this
>> change, for example, makes clang_BlockCommandComment_getParagraph()
>> return incomplete results -- just the first paragraph.
>
> It seems like all they can do is get the text of the paragraph, so why
not
> just fake up a dummy paragraph whose text content is the content of all
the
> paragraphs? and document this restriction and point the user to a new API
> that gives the full access to the paragraphs? I'm also really worried
that
> this will also change the XML schema in an incompatible way: the XML
schema
> is by extension part of the stability promise of the C API, and can't be
> changed incompatibly without breaking users.

Or we could only return the first paragraph. Faking a paragraph
requires creating AST nodes, which requires memory allocation on the
ASTContext, which is permanent.

Why not create it on demand in "freeable" storage?

The schema change is:

Index: bindings/xml/comment-xml-schema.rng

--- bindings/xml/comment-xml-schema.rng»(revision 194522)
+++ bindings/xml/comment-xml-schema.rng»(working copy)
@@ -393,7 +393,9 @@
           <!-- In general, template parameters with whitespace discussion
                should not be emitted. Schema might be more strict here.
-->
           <element name="Discussion">
- <ref name="TextBlockContent" />
+ <oneOrMore>
+ <ref name="TextBlockContent" />
+ </oneOrMore>
           </element>
         </element>
       </oneOrMore>
@@ -435,7 +437,9 @@
                should not be emitted, unless direction is explicitly
specified.
                Schema might be more strict here. -->
           <element name="Discussion">
- <ref name="TextBlockContent" />
+ <oneOrMore>
+ <ref name="TextBlockContent" />
+ </oneOrMore>
           </element>
         </element>
       </oneOrMore>

I think that this change is compatible (but it depends on how one
defines "compatible"). Previously, <Discussion> tag could contain
only a single paragraph; after this change it can contain multiple.

I'm not an XML expert, but that looks like it should be compatible for most
non-pathological use cases.

-- Sean Silva

There is no API to free a CXComment node, because it is supposed to
represent an real AST node, which is allocated on ASTContext.

Dmitri

>
>> Or we could only return the first paragraph. Faking a paragraph
>> requires creating AST nodes, which requires memory allocation on the
>> ASTContext, which is permanent.
>
> Why not create it on demand in "freeable" storage?

There is no API to free a CXComment node, because it is supposed to
represent an real AST node, which is allocated on ASTContext.

Can't you allocate it on the CXTranslationUnit (or whatever the relevant
nearby "context" object is that is explicitly created/destroyed)?

-- Sean Silva

There is no difference, because the lifetime of these contexts is the same.

Dmitri

>
>>
>> >
>> >> Or we could only return the first paragraph. Faking a paragraph
>> >> requires creating AST nodes, which requires memory allocation on the
>> >> ASTContext, which is permanent.
>> >
>> > Why not create it on demand in "freeable" storage?
>>
>> There is no API to free a CXComment node, because it is supposed to
>> represent an real AST node, which is allocated on ASTContext.
>
>
> Can't you allocate it on the CXTranslationUnit (or whatever the relevant
> nearby "context" object is that is explicitly created/destroyed)?

There is no difference, because the lifetime of these contexts is the same.

Wait then what's the big deal with just allocating them permanently? It's
not like these comments are going to be gigabytes in size.

-- Sean Silva

Not gigabytes, but possibly very large. For example, on OS X, the
text of fully preprocessed Cocoa.h with comments takes over 90 Mb.

Anyway, I see that a stable API promise from libclang is not to be
taken lightly, so it might be the case that we will need to keep these
APIs. But this case clearly shows that exposing such low-level
details is not sustainable, and leads to unfortunate workarounds.

Dmitri

>
>>
>> >
>> >>
>> >> >
>> >> >> Or we could only return the first paragraph. Faking a paragraph
>> >> >> requires creating AST nodes, which requires memory allocation on
the
>> >> >> ASTContext, which is permanent.
>> >> >
>> >> > Why not create it on demand in "freeable" storage?
>> >>
>> >> There is no API to free a CXComment node, because it is supposed to
>> >> represent an real AST node, which is allocated on ASTContext.
>> >
>> >
>> > Can't you allocate it on the CXTranslationUnit (or whatever the
relevant
>> > nearby "context" object is that is explicitly created/destroyed)?
>>
>> There is no difference, because the lifetime of these contexts is the
>> same.
>
> Wait then what's the big deal with just allocating them permanently? It's
> not like these comments are going to be gigabytes in size.

Not gigabytes, but possibly very large. For example, on OS X, the
text of fully preprocessed Cocoa.h with comments takes over 90 Mb.

That's why I was suggesting to create them on-demand; that 90Mb will only
be paid if they happen to call getParagraph on every single comment in
those headers. Also I would suggest to not allocate them in the ASTContext,
and just keep them in the CXTranslationUnit (or whatever the "context"
object is) as strictly a "compatibility" thing.

Anyway, I see that a stable API promise from libclang is not to be
taken lightly, so it might be the case that we will need to keep these
APIs.

Breaking APIs that are claimed to be stable and that users depend on is one
of the worst sins in software engineering; far worse than for example
breaking tests (insofar that those tests don't indicate that the stable API
is broken... I think you get what I'm trying to say).
API stability plays roles in business decisions e.g. no budget or developer
time is allocated to "fixing" breakage (since it's stable); if suddenly the
API breaks, then developer time will need to be allocated to fixing it
instead of working on "actual stuff that we planned and that is needed for
the business". Often it will be a nontrivial amount of work to fix API
breakage; even besides developer time. For example the planned deployment
strategy may not admit bundling an XML library, and then suddenly fixing
the problem might require coordination with non-technical management or
marketing ("we had to switch to this other installer software which
requires a 240x240 logo on a dark background instead of 180x180 on a light
background"), etc.; not all breakage is as bad as I just described, but
there are a lot of cases where breakage can cause *much* worse issues than
I just described)

But this case clearly shows that exposing such low-level
details is not sustainable, and leads to unfortunate workarounds.

Indeed. It is a design decision like any other. Generally speaking any
declaration that is added to a stable public header should be accompanied
by a lot of careful thought about the ramifications.

In this regard, things like (schema-validated) XML are useful since they
generally admit more flexibility in what can be changed without breaking
existing code that processed it.

-- Sean Silva