tabs in the input and their effect on the column position

Hi All,

I'm currently experimenting with improving doxygen's parsing capabilities by using the information from clang.
I'm using the libclang functions clang_tokenize and clang_annotateTokens to create hyperlinked and syntax highlighted
output for the source files processed by doxygen.

So far it works quite well, but when the source file contains a tab character this seems to be counted as one character, causing
the output to be misaligned.

Is there some way to configure the number of spaces in a tab? or is there a way to replace tabs by spaces before sending the
contents of a file to libclang, without first having to write the detabbed file to disk?

Any help is appreciated.

Regards,
  Dimitri

I want to jump in before anyone starts suggesting solutions, and point out that all libclang output is in terms of bytes from the start of a line. That means that tabs show up as one byte, but it also means that if the source contains multibyte characters, you may have a range of three bytes referring to a single character (and a single column).

Clang and libclang expect you to interpret their output appropriately for your use case. Perhaps you process tabs to mean “new table cell”.

All that said, making the byte/column map machinery in TextDiagnostic more reusable would probably be a good thing all around. LLVM’s diagnostics don’t handle multibyte characters at all.

Jordan

Hi Jordan,

Thanks for the info.

It would have been nice if clang_tokenize would have included whitespace as tokens (if only as
an option). Then one could reconstruct the output by purely looking at the tokens. Without it one
needs to extract the whitespace between the tokens in the original file and analyse it to
see how it should be rendered in the output. Doesn't sound very efficient nor convenient.

Meanwhile I found out how to use CXUnsavedFile to pass a de-tabbed representation of the
original file, which works for me. Note that in my case both input and output are UTF-8
encoded, so the multi-byte characters are just passed through fine.

Regards,
  Dimitri

Adding an option to include whitespace tokens in clang_tokenize is a valid feature request. Since all we're doing is lexing (not parsing), Clang certainly has this capability; it's just not exposed. Please file a feature request at http://llvm.org/bugs/.

Just to warn you, if output is misaligned for tabs, it could potentially be misaligned for wide characters as well. But perhaps that's not important in your current use case; I'm glad you found a solution that works.

Jordan