RFC: Fuzzy parser for highlighting C++

Hi all,

I am working on a Google Summer of Code project to use the clang lexer for
syntax highlighting. The intended usage is to highlight C++ for LaTeX
(papers, presentations), HTML (documentations, wikis) and other formats.
My goal is to provide a better alternative to Pygments (which highlights C++
on the llvm.org docs) or GNU Source-highlight. These tools can identify
keywords perfectly well, but aren't able to highlight types and functions.

To correctly highlight those source snippets, I wrote a fuzzy parser library on
top of the clang lexer. The clang parser cannot be used for this as snippets
don't need to be self-contained, e.g. use types or functions which definitions
aren't included.

The fuzzy parser doesn't understand all language constructs of C++, but enough
to produce a reasonably good highlighting. A sample output produced with
LaTeX an be found on github [1] (136 KB). There's also more documentation
about clang-highlight [2] and the fuzzy parser [3].

I submitted my work for review on phabricator [4] to get it into
clang/tools/extra.

The fuzzy parser is a general library that may have some other potential uses
beside clang-highlight. clang-format internally has a similar fuzzy parser
and is currently more complete, but not written in a reusable way.
Another possible use would be for an auto complete system for editors.

Any opinions or suggestions about this project?

Best,
Johannes

  1 : https://github.com/kapf/clang-highlight/blob/master/latex/fuzzyparser.pdf?raw=true
  2 : clang-highlight/clang-highlight.rst at master · kapf/clang-highlight · GitHub
  3 : clang-highlight/LibFuzzy.rst at master · kapf/clang-highlight · GitHub
  4 : ⚙ D4725 Add new tool clang-highlight to clang/tools/extra

This sounds similar to what clang-format does (lib/Format/UnwrappedLineParser.cpp) – have you looked at that?

This sounds similar to what clang-format does (lib/Format/UnwrappedLineParser.cpp) – have you looked at that?

He says in the fifth paragraph:

Hi all,

I am working on a Google Summer of Code project to use the clang lexer for
syntax highlighting. The intended usage is to highlight C++ for LaTeX
(papers, presentations), HTML (documentations, wikis) and other formats.
My goal is to provide a better alternative to Pygments (which highlights
C++
on the llvm.org docs) or GNU Source-highlight. These tools can identify
keywords perfectly well, but aren't able to highlight types and functions.

To correctly highlight those source snippets, I wrote a fuzzy parser
library on
top of the clang lexer. The clang parser cannot be used for this as
snippets
don't need to be self-contained, e.g. use types or functions which
definitions
aren't included.

The fuzzy parser doesn't understand all language constructs of C++, but
enough
to produce a reasonably good highlighting. A sample output produced with
LaTeX an be found on github [1] (136 KB). There's also more documentation
about clang-highlight [2] and the fuzzy parser [3].

I submitted my work for review on phabricator [4] to get it into
clang/tools/extra.

The fuzzy parser is a general library that may have some other potential
uses
beside clang-highlight. clang-format internally has a similar fuzzy parser
and is currently more complete, but not written in a reusable way.

Have you tried talking to the clang-format authors about making this code
more reusable? I think a reusable "fuzzy parser" would be quite generally
useful.

Realistically speaking I doubt (purely from a maintenance perspective) that
we will ever have 2 fuzzy parsers in-tree so evolving the clang-format
parser seems like the natural path forward for this sort of work.

-- Sean Silva

Hi all,

I am working on a Google Summer of Code project to use the clang lexer for
syntax highlighting. The intended usage is to highlight C++ for LaTeX
(papers, presentations), HTML (documentations, wikis) and other formats.
My goal is to provide a better alternative to Pygments (which highlights
C++
on the llvm.org docs) or GNU Source-highlight. These tools can identify
keywords perfectly well, but aren't able to highlight types and functions.

To correctly highlight those source snippets, I wrote a fuzzy parser
library on
top of the clang lexer. The clang parser cannot be used for this as
snippets
don't need to be self-contained, e.g. use types or functions which
definitions
aren't included.

The fuzzy parser doesn't understand all language constructs of C++, but
enough
to produce a reasonably good highlighting. A sample output produced with
LaTeX an be found on github [1] (136 KB). There's also more documentation
about clang-highlight [2] and the fuzzy parser [3].

I submitted my work for review on phabricator [4] to get it into
clang/tools/extra.

The fuzzy parser is a general library that may have some other potential
uses
beside clang-highlight. clang-format internally has a similar fuzzy
parser
and is currently more complete, but not written in a reusable way.

Have you tried talking to the clang-format authors about making this code
more reusable? I think a reusable "fuzzy parser" would be quite generally
useful.

Realistically speaking I doubt (purely from a maintenance perspective)
that we will ever have 2 fuzzy parsers in-tree so evolving the clang-format
parser seems like the natural path forward for this sort of work.

Yes he has and I did mentor his project. I generally agree that we don't
want to have 2 fuzzy parsers, but at this stage, clang-format's parser is
to intricately tangled with clang-format itself. A fresh start seems like
the most promising approach to me, taking some of the learnings of
clang-format's parser and putting them into a reusable library. If
successful, we'll be able to switch clang-format over to that parser and
simplify clang-format's implementation.

Also, while clang-format's parser is more complete ins some ways, it has
also been highly tuned to extract only the information from the source code
that is relevant to formatting. E.g. while it might be essential for
highlighting to (somewhat) correctly determine type information, it doesn't
matter for source code formatting at several places. Thus, I am not sure
whether clang-format's current parser can really be reused/extended for
other applications.

Hi all,

I am working on a Google Summer of Code project to use the clang lexer
for
syntax highlighting. The intended usage is to highlight C++ for LaTeX
(papers, presentations), HTML (documentations, wikis) and other formats.
My goal is to provide a better alternative to Pygments (which highlights
C++
on the llvm.org docs) or GNU Source-highlight. These tools can identify
keywords perfectly well, but aren't able to highlight types and
functions.

To correctly highlight those source snippets, I wrote a fuzzy parser
library on
top of the clang lexer. The clang parser cannot be used for this as
snippets
don't need to be self-contained, e.g. use types or functions which
definitions
aren't included.

The fuzzy parser doesn't understand all language constructs of C++, but
enough
to produce a reasonably good highlighting. A sample output produced with
LaTeX an be found on github [1] (136 KB). There's also more
documentation
about clang-highlight [2] and the fuzzy parser [3].

I submitted my work for review on phabricator [4] to get it into
clang/tools/extra.

The fuzzy parser is a general library that may have some other potential
uses
beside clang-highlight. clang-format internally has a similar fuzzy
parser
and is currently more complete, but not written in a reusable way.

Have you tried talking to the clang-format authors about making this code
more reusable? I think a reusable "fuzzy parser" would be quite generally
useful.

Realistically speaking I doubt (purely from a maintenance perspective)
that we will ever have 2 fuzzy parsers in-tree so evolving the clang-format
parser seems like the natural path forward for this sort of work.

Yes he has and I did mentor his project.

Ah. I'm surprised that there hasn't been more (any?) on-list traffic about
this; usually we at least have an RFC (and doesn't GSoC require one?). Did
a GSoC proposal ever make it to the list?

I generally agree that we don't want to have 2 fuzzy parsers, but at this
stage, clang-format's parser is to intricately tangled with clang-format
itself. A fresh start seems like the most promising approach to me, taking
some of the learnings of clang-format's parser and putting them into a
reusable library. If successful, we'll be able to switch clang-format over
to that parser and simplify clang-format's implementation.

Neat. That would be great.

-- Sean Silva