Adding more HTML-related facilities in Doxygen comment parsing

Hello,

In one of the threads on cfe-commits I was asked by Tobias to provide
a rationale for adding more HTML-related validation facilities in
Clang's comment parsing.

HTML is an indivisible part of Doxygen syntax. It is impossible to
parse Doxygen without not only merely parsing, but doing semantic
analysis on HTML tags.

For example, paragraph splitting is more complex than just finding an
empty line.

/// <b>Aaa
///
/// Bbb
void f();

/// <table>
/// <tr><td>Aaa</tr></td>
///
/// </table>
void g();

Somehow (I am not saying that the rules make sense, it is just what it
is), Doxygen interprets this like this:

1. for f():

<p><b>Aaa</b></p>
<p><b>Bbb </b></p>

An unterminated <b> tag started to span multiple paragraphs.

2. for g():

<table class="doxtable">
<tr>
<td><p class="starttd">Aaa</p>
<p class="endtd"></p>
</td></tr>
</table>

An empty line between table tags made Doxygen add a second paragraph
to a table cell that had its content clearly specified.

Judging just from these two simple examples, it is clear that in order
to parse embedded HTML in Doxygen so that the output actually takes
the HTML markup into account, requires *semantic* analysis of HTML
tags, and transformation of the HTML AST.

It is a non-trivial amount of work to implement this, and I did look
for HTML libraries that could help us in doing so. libtidy [1] is a
nice one, except that I got the impression that it is "stabilized to
the point of becoming unmaintained" -- there are no releases, code is
available through cvs only, and it was not updated for HTML5. There
is an experimental HTML5 fork of it [2], which was not updated for
more than 2 years, and probably does not correspond to the current
HTML5 draft.

But even if libtidy did completely support HTML5, its interface is not
suitable for fine-grained parsing and AST manipulation that we need.
The interface accepts only complete HTML docs for parsing, while Clang
deals with fragments. Constructing the HTML AST though libtidy just
to figure out what the tag name is is not going to deliver good
performance either.

Apart from libtidy, I did not find other *lightweight* libraries (not
HTML rendering engines) that provide low-level manipulation that we
need.

But parsing and doing semantic analysis correctly is only half of the
story. Sanitizing the output is important, otherwise Clang clients
can not use the HTML parts of comments, and have to re-do the parsing
work, now with the intent of sanitizing the output. I think it is
reasonable to state that almost all clients want the output as
well-formed HTML and sanitized of javascript. It rarely (if ever)
makes sense to put executable javascript into comments anyway.

I hope this addresses everyone's concerns.

Dmitri

[1] http://tidy.sourceforge.net/
[2] https://github.com/w3c/tidy-html5

Hello,

In one of the threads on cfe-commits I was asked by Tobias to provide
a rationale for adding more HTML-related validation facilities in
Clang's comment parsing.

Thanks Dimitri. This is very informative.

HTML is an indivisible part of Doxygen syntax. It is impossible to
parse Doxygen without not only merely parsing, but doing semantic
analysis on HTML tags.

For example, paragraph splitting is more complex than just finding an
empty line.

/// <b>Aaa
///
/// Bbb
void f();

/// <table>
/// <tr><td>Aaa</tr></td>
///
/// </table>
void g();

Somehow (I am not saying that the rules make sense, it is just what it
is), Doxygen interprets this like this:

1. for f():

<p><b>Aaa</b></p>
<p><b>Bbb </b></p>

An unterminated <b> tag started to span multiple paragraphs.

2. for g():

<table class="doxtable">
<tr>
<td><p class="starttd">Aaa</p>
<p class="endtd"></p>
</td></tr>
</table>

An empty line between table tags made Doxygen add a second paragraph
to a table cell that had its content clearly specified.

Judging just from these two simple examples, it is clear that in order
to parse embedded HTML in Doxygen so that the output actually takes
the HTML markup into account, requires *semantic* analysis of HTML
tags, and transformation of the HTML AST.

It is a non-trivial amount of work to implement this, and I did look
for HTML libraries that could help us in doing so. libtidy [1] is a
nice one, except that I got the impression that it is "stabilized to
the point of becoming unmaintained" -- there are no releases, code is
available through cvs only, and it was not updated for HTML5. There
is an experimental HTML5 fork of it [2], which was not updated for
more than 2 years, and probably does not correspond to the current
HTML5 draft.

But even if libtidy did completely support HTML5, its interface is not
suitable for fine-grained parsing and AST manipulation that we need.
The interface accepts only complete HTML docs for parsing, while Clang
deals with fragments. Constructing the HTML AST though libtidy just
to figure out what the tag name is is not going to deliver good
performance either.

Apart from libtidy, I did not find other *lightweight* libraries (not
HTML rendering engines) that provide low-level manipulation that we
need.

But parsing and doing semantic analysis correctly is only half of the
story. Sanitizing the output is important, otherwise Clang clients
can not use the HTML parts of comments, and have to re-do the parsing
work, now with the intent of sanitizing the output. I think it is
reasonable to state that almost all clients want the output as
well-formed HTML and sanitized of javascript. It rarely (if ever)
makes sense to put executable javascript into comments anyway.

Out of interest. What is required to sanitize HTML? Do we need a full HTML5 parser, including all the quirks? With javascript support? How large do you expect this to become? During the time the support is incomplete, can we provide any guarantees about the absence of javascript?

Thanks again Dimitri!

Tobias

Out of interest. What is required to sanitize HTML?

There are two different levels of sanitizing:
- well-formedness of HTML,
- absence of javascript.

The former is harder to guarantee than the latter, but it is important
nevertheless, because being able to directly pass through HTML from
Clang's output into a webpage template and get back a document that
passes validation is a useful property.

Do we need a full HTML5
parser, including all the quirks? With javascript support? How large do you
expect this to become? During the time the support is incomplete, can we
provide any guarantees about the absence of javascript?

Our filtering is based on a whitelist for HTML tags and a blacklist
for attributes. I did my best to look though HTML5 spec and find
attributes that can contain embedded javascript, and added those to
the blacklist. I think our filtering is reasonable and should not
allow any javascript according to HTML5 spec. But a black list is a
black list, for example if a certain browser supports a non-standard
attribute with embedded JS, we will not catch that.

When implementing further semantic analysis for Doxygen parsing, I
don't expect many quirks to come from HTML. Most of HTML quirks are
about rendering, not parsing. In fact, parsing and extraction of HTML
tags has been implemented for a long time already, we just have no
idea about their semantics (we only know if a tag may have an end tag
</foo> or not). I expect more complexity in implementing undocumented
Doxygen rules about interaction between Doxygen markup and HTML.

Dmitri

Very interesting. Thanks a lot.

Tobias

Out of interest. What is required to sanitize HTML?

There are two different levels of sanitizing:
- well-formedness of HTML,
- absence of javascript.

The former is harder to guarantee than the latter, but it is important
nevertheless, because being able to directly pass through HTML from
Clang's output into a webpage template and get back a document that
passes validation is a useful property.

Dmitri, this may be an interesting problem to solve but it doesn't make sense to build it into libclang.

LLVM has no procedure for 0-day vulnerabilities, contacting vendors and pushing updates working with the web community, nor should it. What happens if a 0-day cross-site-scripting attack is found and user passwords are stolen?

This is really so far out of scope and mislayered, that it's very much a disservice to the few users who might actually use the facility. Why are we building a web technology security validator into clang that is insecure? That's a separate project.

Ordinarily you pipe tool output through a well-maintained and up-to-date script that knows about browser and JavaScript quirks. Can we please just point users to that workflow and get on with things?

Do we need a full HTML5
parser, including all the quirks? With javascript support? How large do you
expect this to become? During the time the support is incomplete, can we
provide any guarantees about the absence of javascript?

Our filtering is based on a whitelist for HTML tags and a blacklist
for attributes. I did my best to look though HTML5 spec and find
attributes that can contain embedded javascript, and added those to
the blacklist. I think our filtering is reasonable and should not
allow any javascript according to HTML5 spec. But a black list is a
black list, for example if a certain browser supports a non-standard
attribute with embedded JS, we will not catch that.

This is not a problem the compiler should be dealing with on any level, let alone by hand. This is a significant chunk of code that needn't be there.

When implementing further semantic analysis for Doxygen parsing, I
don't expect many quirks to come from HTML. Most of HTML quirks are
about rendering, not parsing. In fact, parsing and extraction of HTML
tags has been implemented for a long time already, we just have no
idea about their semantics (we only know if a tag may have an end tag
</foo> or not). I expect more complexity in implementing undocumented
Doxygen rules about interaction between Doxygen markup and HTML.

As someone who has worked on an HTML5 parser or two, and JavaScript too, I fail to see how the HTML/JavaScript filters in ToT serve any purpose at all because they are, and always will, be trivially exploitable.

~20,000 LoC implementing XML schemas, HTML, JavaScript validators .. are all so intertwined it's difficult to cut things down to provide the basic comment callbacks and diagnostics users would benefit from.

Alp.

Out of interest. What is required to sanitize HTML?

There are two different levels of sanitizing:
- well-formedness of HTML,
- absence of javascript.

The former is harder to guarantee than the latter, but it is important
nevertheless, because being able to directly pass through HTML from
Clang's output into a webpage template and get back a document that
passes validation is a useful property.

Dmitri, this may be an interesting problem to solve but it doesn't make
sense to build it into libclang.

LLVM has no procedure for 0-day vulnerabilities, contacting vendors and
pushing updates working with the web community, nor should it. What happens
if a 0-day cross-site-scripting attack is found and user passwords are
stolen?

This is really so far out of scope and mislayered, that it's very much a
disservice to the few users who might actually use the facility. Why are we
building a web technology security validator into clang that is insecure?
That's a separate project.

Ordinarily you pipe tool output through a well-maintained and up-to-date
script that knows about browser and JavaScript quirks. Can we please just
point users to that workflow and get on with things?

Parsing Doxygen is inherently intertwined with HTML parsing and
semantic analysis. Doing filtering at the same level does not look
out of scope and mislayered.

When implementing further semantic analysis for Doxygen parsing, I
don't expect many quirks to come from HTML. Most of HTML quirks are
about rendering, not parsing. In fact, parsing and extraction of HTML
tags has been implemented for a long time already, we just have no
idea about their semantics (we only know if a tag may have an end tag
</foo> or not). I expect more complexity in implementing undocumented
Doxygen rules about interaction between Doxygen markup and HTML.

As someone who has worked on an HTML5 parser or two, and JavaScript too, I
fail to see how the HTML/JavaScript filters in ToT serve any purpose at all
because they are, and always will, be trivially exploitable.

I would disagree.

~20,000 LoC implementing XML schemas, HTML, JavaScript validators .. are all
so intertwined it's difficult to cut things down to provide the basic
comment callbacks and diagnostics users would benefit from.

Alp, the way you have been putting this discussion is
non-constructive. You are trying to reuse Clang's comment parsing for
some other purpose, yet unknown. It seems that it is hard for you to
factor the code (because it is tied to Clang's ASTs, on purpose of
providing diagnostics), but you start blaming the code and finding
deficiencies when there are none.

Dmitri

There are no 3rd party libraries or tools which already do this which
we could then rely on? If not, where do you see this code living
within the overall structure of the compiler? Will it continue to be a
part of clangAST like the other comment-related code?

~Aaron

Hi Aaron,

As I explained in the first message in this thread, libtidy would
technically work, except: (1) it was never updated for HTML5, and does
not have formal releases, it is probably also unmaintained, and (2)
constructing an HTML DOM just to check the tag name is a superfluous
exercise in using a library just for the sake of using a library and
it will not deliver good performance either. Apart from libtidy, I am
not aware of other libraries with suitable functionality and licence.

The HTML tables and helpers can be factored out somewhere into
clangBasic, clangHTML or even llvmSupport or llvmHTML -- this is a
bikesched that I mostly don't care about. But Doxygen-specific
semantic analysis of HTML, as illustrated in the first post, has to
live in the same library as comment parsing.

About comment parsing living in libAST -- it is possible to move it to
a separate library, but we would have to bounce off an abstract base
class to untie the circular dependency between ASTContext and that
libClangComment. Because currently comment parsing is living in
completely separate files in libAST, which are clearly named as such,
I don't think that comment parsing is a burden for Clang developers
working on libAST.

Dmitri

You've convinced yourself that the existing code has no deficiencies, therefore any suggestions must be motivated by some unknown ulterior purpose.

That's not the case and I can tell you there's no conspiracy :slight_smile:

I do however make an incisive observation that's perhaps not easy to hear. Users are looking for two things:

1) Fast -Wdocumentation that basically just ensures that \param and \return match what's in the declarator, and perhaps at most extracts the \brief. Why not use the Regex class and do this in 100 lines of code in ParseDecl.h? If the regex fails, no big deal.

2) Efficient and flexible callbacks to support documentation tooling and IDEs with interactive. We're totally failing to expose useful callbacks at present. That's because it's all getting hard-coded into a massive monolthic world view of how and when docs should be consumed that nobody is really using. If we sit down and instead decide which ASTConsumer interface to put this into that's perhaps no more than 50 lines of code.

We're failing miserably at both of these right now, and both are things a compiler should actually provide.

There is instead this "comment AST" which doesn't serve any mainstream use case but appears to be part of a Doxygen-like tool you're building. We cannot even enable it on our own build bots on llvm.org. Despite the prescribed terminology these documents aren't really an abstract syntax tree, nor are they even part of clang's AST. There are curiously named Lex, Parse and Sema classes that mimic clang, and even a set of RAV templates, presumably to visit documents that rarely have a depth greater than one level?

This application should be split out into an external plugin or tool while we look for a quick solution to (1) and (2). As an external tool it will help validate clang's programming interfaces.

There is no criticism here, just calling what I see. Could we get back to discussing how to split this out in an orderly manner?

Cheers,
Alp.

I agree with Alp that this looks like a case of poor layering. I only use Markdown within my doxygen comments, and so I can easily imagine a use for a tool that extracts doc comments but does not handle HTML. Similarly, I don't think it would make sense to have libclang embed a Markdown parser.

If a tool wants to be able to understand comments written in HTML, Markdown, ReST, or one of myriad other markup languages, then it will most likely also want to transform these into some internal data structure for representing structured text (NSAttributedString, whatever), and (depending on the use) may wish to do so in a way that permits editing and exporting.

There are already libraries that do this for many GUI toolkits. Recreating a subset of them in libclang doesn't seem very useful and does sound like something that would increase the code size (and attack surface) a lot if done in a way that's vaguely feature-complete way.

David

~20,000 LoC implementing XML schemas, HTML, JavaScript validators .. are
all
so intertwined it's difficult to cut things down to provide the basic
comment callbacks and diagnostics users would benefit from.

Alp, the way you have been putting this discussion is
non-constructive. You are trying to reuse Clang's comment parsing for
some other purpose, yet unknown. It seems that it is hard for you to
factor the code (because it is tied to Clang's ASTs, on purpose of
providing diagnostics), but you start blaming the code and finding
deficiencies when there are none.

You've convinced yourself that the existing code has no deficiencies,
therefore any suggestions must be motivated by some unknown ulterior
purpose.

That's not the case and I can tell you there's no conspiracy :slight_smile:

I do however make an incisive observation that's perhaps not easy to hear.
Users are looking for two things:

1) Fast -Wdocumentation that basically just ensures that \param and \return
match what's in the declarator, and perhaps at most extracts the \brief. Why
not use the Regex class and do this in 100 lines of code in ParseDecl.h? If
the regex fails, no big deal.

Why not re-implement C++ parsing with a regex then?.. This sounds
like the same kind of argument to me. Will your regex run in linear
time like the current parser does? You can not extract \brief, by the
way, without skipping HTML tags, unescaping Doxygen escape sequences
and probably tons of other quirks that I can not remember on the spot.

2) Efficient and flexible callbacks to support documentation tooling and
IDEs with interactive. We're totally failing to expose useful callbacks at
present. That's because it's all getting hard-coded into a massive monolthic
world view of how and when docs should be consumed that nobody is really
using.

I want to reassure you that this is used.

If we sit down and instead decide which ASTConsumer interface to put
this into that's perhaps no more than 50 lines of code.

Please explain how a callback-based interface would be better than an
AST interface.

There is instead this "comment AST" which doesn't serve any mainstream use
case but appears to be part of a Doxygen-like tool you're building.

We
cannot even enable it on our own build bots on llvm.org.

-Wdocumentation is enabled on all my clang_fast buildbots, lldb and
lld buildbots.

Despite the
prescribed terminology these documents aren't really an abstract syntax
tree,

They are an abstract syntax tree.

nor are they even part of clang's AST. There are curiously named Lex,
Parse and Sema classes that mimic clang, and even a set of RAV templates,
presumably to visit documents that rarely have a depth greater than one
level?

I don't see a problem with this. Would you rather see one monolithic
class where all processing, starting from unescaping Doxygen escapes
to matching \param to declarations, is jammed together?

This application should be split out into an external plugin or tool while
we look for a quick solution to (1) and (2). As an external tool it will
help validate clang's programming interfaces.

There is no criticism here, just calling what I see. Could we get back to
discussing how to split this out in an orderly manner?

Again you are talking about splitting this, and yet I have to see how
Clang would benefit from this.

Dmitri

Parsing Doxygen is inherently intertwined with HTML parsing and
semantic analysis. Doing filtering at the same level does not look
out of scope and mislayered.

There are no 3rd party libraries or tools which already do this which
we could then rely on? If not, where do you see this code living
within the overall structure of the compiler? Will it continue to be a
part of clangAST like the other comment-related code?

Hi Aaron,

As I explained in the first message in this thread, libtidy would
technically work, except: (1) it was never updated for HTML5, and does
not have formal releases, it is probably also unmaintained, and (2)
constructing an HTML DOM just to check the tag name is a superfluous
exercise in using a library just for the sake of using a library and
it will not deliver good performance either. Apart from libtidy, I am
not aware of other libraries with suitable functionality and licence.

Fair (I missed that entire paragraph originally... sorry!).

The HTML tables and helpers can be factored out somewhere into
clangBasic, clangHTML or even llvmSupport or llvmHTML -- this is a
bikesched that I mostly don't care about.

I don't believe this to be a bikeshed at all; it's actually my primary
concern at this point.

But Doxygen-specific
semantic analysis of HTML, as illustrated in the first post, has to
live in the same library as comment parsing.

About comment parsing living in libAST -- it is possible to move it to
a separate library, but we would have to bounce off an abstract base
class to untie the circular dependency between ASTContext and that
libClangComment. Because currently comment parsing is living in
completely separate files in libAST, which are clearly named as such,
I don't think that comment parsing is a burden for Clang developers
working on libAST.

I am mostly concerned about factoring this out such that it can be
disabled via build flags when building Clang. I've not seen much
information about just how far down the rabbit hole this sanitization
will go (for instance, if there's a DTD, will it be followed?), so I
am worried about security implications from this. I also agree with
Alp that this feels like a scope issue -- why is a C-family compiler
getting HTML validation + sanitization as part of its core components
(increasing maintenance burden, review burdens, etc)?

FWIW, if there was a way to turn Doxygen support of this nature into a
plugin, my complaints would vanish.

~Aaron

David,

HTML is a part of Doxygen. If we are not doing it, then we are
implementing our own documentation language that no other person in
the world cares about. This is as if someone said, "I don't use
partial specialization of templates in C++, so Clang should not be
implementing it."

Dmitri

I think you are missing the point. The Clang libraries parse C++ into an AST, which is a clang-specific data structure. That's fine, because there aren't many other libraries that expose C++ AST data structures that users of clang want to interoperate with. Clang then generates LLVM IR and object code from C++, using well-defined (or, in some cases, poorly defined, but at least vaguely standardised) ABIs.

This is in direct contrast to a consumer of documentation, which may want to integrate with one of many different libraries that already provide complex data structures and APIs for handling rich text.

Currently, libclang exposes the 'comment AST', which is an unwieldy thing that doesn't seem to address any needs. As a consumer of the documentation, I either want:

- To parse the doc comments myself.

- To have them transformed into something that I can easily consume with an existing parser (e.g. HTML).

You also seem to be under the impression that doxygen is the only markup language that is found in [Objective-]C[++] source files. For Objective-C, Apple's HeaderDoc and GSDoc are more popular, but there are half a dozen other less-popular one.

I fully support interfaces in libclang that allow plugins for different comment markup languages, but deciding to hard-code one (and one that is poorly defined and apparently allows all of HTML 5) seems like a terrible idea.

David

HTML is a part of Doxygen. If we are not doing it, then we are
implementing our own documentation language that no other person in
the world cares about. This is as if someone said, "I don't use
partial specialization of templates in C++, so Clang should not be
implementing it."

I think you are missing the point. The Clang libraries parse C++ into an AST, which is a clang-specific data structure. That's fine, because there aren't many other libraries that expose C++ AST data structures that users of clang want to interoperate with. Clang then generates LLVM IR and object code from C++, using well-defined (or, in some cases, poorly defined, but at least vaguely standardised) ABIs.

This is in direct contrast to a consumer of documentation, which may want to integrate with one of many different libraries that already provide complex data structures and APIs for handling rich text.

Currently, libclang exposes the 'comment AST', which is an unwieldy thing that doesn't seem to address any needs.

Not only. It also exposes a cooked comment in XML format with a
well-defined schema, that preserves the markup and semantic pieces of
the AST. You can XSLT that XML into HTML.

You also seem to be under the impression that doxygen is the only markup language that is found in [Objective-]C[++] source files. For Objective-C, Apple's HeaderDoc and GSDoc are more popular, but there are half a dozen other less-popular one.

HeaderDoc is sufficiently similar to Doxygen, and in fact, Clang's
parser is forgiving enough to consume HeaderDoc as well.

I fully support interfaces in libclang that allow plugins for different comment markup languages, but deciding to hard-code one (and one that is poorly defined and apparently allows all of HTML 5) seems like a terrible idea.

Doxygen is, more or less, an industry standard (one of, at least). As
soon as there is someone who is willing to implement a second comment
markup language, I am willing to help with factoring.

Dmitri

HTML is a part of Doxygen. If we are not doing it, then we are
implementing our own documentation language that no other person in
the world cares about. This is as if someone said, "I don't use
partial specialization of templates in C++, so Clang should not be
implementing it."

I think you are missing the point. The Clang libraries parse C++ into an AST, which is a clang-specific data structure. That's fine, because there aren't many other libraries that expose C++ AST data structures that users of clang want to interoperate with. Clang then generates LLVM IR and object code from C++, using well-defined (or, in some cases, poorly defined, but at least vaguely standardised) ABIs.

This is in direct contrast to a consumer of documentation, which may want to integrate with one of many different libraries that already provide complex data structures and APIs for handling rich text.

Currently, libclang exposes the 'comment AST', which is an unwieldy thing that doesn't seem to address any needs.

Not only. It also exposes a cooked comment in XML format with a
well-defined schema, that preserves the markup and semantic pieces of
the AST. You can XSLT that XML into HTML.

So it doesn't address any needs? Good we've cleared that up.

You also seem to be under the impression that doxygen is the only markup language that is found in [Objective-]C[++] source files. For Objective-C, Apple's HeaderDoc and GSDoc are more popular, but there are half a dozen other less-popular one.

HeaderDoc is sufficiently similar to Doxygen, and in fact, Clang's
parser is forgiving enough to consume HeaderDoc as well.

I fully support interfaces in libclang that allow plugins for different comment markup languages, but deciding to hard-code one (and one that is poorly defined and apparently allows all of HTML 5) seems like a terrible idea.

Doxygen is, more or less, an industry standard (one of, at least). As
soon as there is someone who is willing to implement a second comment
markup language, I am willing to help with factoring.

Nobody is asking for a second comment markup language in clang. This is a thread about removing the first one, a change for which there's already broad consensus.

This is a very significant cleanup of clang's internals saving ~15,000 - 20,000 LoC checked-in. The refactoring will have minimal impact on users and there's to be a callback to allow Doxygen-like tools to be implemented externally, plus a real simple doc comment checker in the parser. The biggest challenge is the libclang C interface but I'm confident we'll be able to work with that given the desire to sort out clang's comment system.

What this means in practice is that clang::comments::RawComment will remain, though it'll no longer need to allocate and store comment string duplicates ahead of time. The logic to attach comments to declarations will remain as-is, though it will be possible to override it. The rest will be split out.

I hope you change your mind about not helping with the factoring because it's a lot of work to take on. Perhaps we can preserve some of the monolithic comment AST support as an external plugin / library.

Alp.

HTML is a part of Doxygen. If we are not doing it, then we are
implementing our own documentation language that no other person in
the world cares about. This is as if someone said, "I don't use
partial specialization of templates in C++, so Clang should not be
implementing it."

I think you are missing the point. The Clang libraries parse C++ into an
AST, which is a clang-specific data structure. That's fine, because there
aren't many other libraries that expose C++ AST data structures that users
of clang want to interoperate with. Clang then generates LLVM IR and object
code from C++, using well-defined (or, in some cases, poorly defined, but at
least vaguely standardised) ABIs.

This is in direct contrast to a consumer of documentation, which may want
to integrate with one of many different libraries that already provide
complex data structures and APIs for handling rich text.

Currently, libclang exposes the 'comment AST', which is an unwieldy thing
that doesn't seem to address any needs.

Not only. It also exposes a cooked comment in XML format with a
well-defined schema, that preserves the markup and semantic pieces of
the AST. You can XSLT that XML into HTML.

So it doesn't address any needs? Good we've cleared that up.

Alp, could you please stop using demagogic tricks and trying to pull
conclusions out of nowhere? Indeed, I would like to remove comment
AST APIs in libclang, but not because they don't serve any purpose,
but because they introduce unnecessary ABI burden, while XML APIs in
libclang do not. But we can not remove comment AST APIs from libclang
because of ABI stability guarantee.

Nobody is asking for a second comment markup language in clang. This is a
thread about removing the first one, a change for which there's already
broad consensus.

I don't believe there is any other developer except you suggesting
anything like that.

Dmitri

If you actually read through the responses you'll find otherwise.

What I don't understand is, why the complete lack of interest in the original proposal for incremental cleanups and layering fixes to ASTContext, RawComment and FullComment if this is code you're interested in maintaining or have strong opinions about?

Throughout that discussion you gave the impression that the code is basically unmaintained. Now you're saying there's a libclang stability argument and asking to keep the feature around longer.

I don't think it's reasonable to expect others to do this grunt work trying to fix the core of the comment system while you commit arbitrary chunks of dubious HTML5 validation.

So perhaps you can provide a more constructive response to the original suggestions for improvement as a way forward..

Alp.

Nobody is asking for a second comment markup language in clang. This is a
thread about removing the first one, a change for which there's already
broad consensus.

I don't believe there is any other developer except you suggesting
anything like that.

If you actually read through the responses you'll find otherwise.

This is false. Please provide concrete quotes without misinterpreting
other people's opinions. There was nobody suggesting it. In fact,
Aaron asked on IRC how hard would it be to add MSVC XML-based comment
parsing support.

What I don't understand is, why the complete lack of interest in the
original proposal for incremental cleanups and layering fixes to ASTContext,
RawComment and FullComment if this is code you're interested in maintaining
or have strong opinions about?

I do not see a concrete proposal, just general remarks about
'factoring' without a concrete goal.

Throughout that discussion you gave the impression that the code is
basically unmaintained.

This is false. There have been commits going in to this code all the
time. Please take your time look through SVN history.

I don't think it's reasonable to expect others to do this grunt work trying
to fix the core of the comment system

Please, Alp, you have yet to demonstrate that something is broken in a
constructive way.

Dmitri

What applications does this HTML5 validation enable? I’ve tried to skim this thread to find the big picture, but I can’t find it.

Why does Clang need to validate the HTML, rather than simply associating comments with Decls and handing them over to a client who knows the details of Doxygen and HTML?