UTF-8 vs. UTF-16 code locations

Hey all,

what would be the best way to get UTF-16 code locations from the clang-c API?

As far as I can see it's not currently possible, and I wonder if it would be
possible with the C++ API which I could then wrap in a new C function.

The reason I'm asking is that we in KDevelop work with QString offsets in the
editor, which is internally UTF-16 encoded. Now imagine we parse an UTF-8
encoded text file with the following contents:

void foo() {
  int c = 0;
  /* ümlaut */ c++;
}

Any API in clang-c that takes or returns a column will be off-by-one from what
we expect from an editor/UTF-16 column pov, due to the 'ü' which takes up two
UTF-8 code points but just one UTF-16 code point. This breaks our highlighting
and code browsing features, but thankfully such input is rare. I'd still like
to fix it though if possible and if it doesn't cost too much runtime
performance.

What is the suggested way of handling this situation? Is there maybe prior art
somewhere to efficiently translate between UTF-8/UTF-16 code locations that I
could study?

Thanks

If your input is UTF-8 and you are internally handling it as UTF-16, you
will need to keep a mapping table. As both UTF-8 and UTF-16 are
variable width encodings (e.g. a given Unicode character can map to a
varying number of UTF-8 or UTF-16 'characters'), it is not possible to
create a static mapping function. You don't necessarily have to map
every character. Since both encodings are essentially state-free, it is
enough to have a starting point earlier and do decoding from that.

Joerg

What Jörg said.

You may want to look at ICU, see http://site.icu-project.org/

Keep in mind that Unicode is more than just dealing with UTF-8 encoding. There is also:
* Multiple characters that the user expect to count as one. Czech think that ch is a single character, so they will expect the cursor to skip over ch as if it were a single character. Maybe the Czech don't mind if you don't get it right for them, I don't know - we're in the murky waters of cultural expectations here.
* Ligatures. I.e. single glyphs that are just two connected characters. You want to assume a character break inside the character.
* Right-to-left scripts. Particularly nasty if RTL and LTR are mixed, you will end having to highlight discontinous screen areas.
* Scripts where letters are written around the next letter. I forgot which script has this, it might be Devanagari.

Line wrapping has more fun of that kind.

You'll need to decide of much of these things are relevant for your user base.

Thanks guys,

I was aware of this. The question is more whether there is prior art in that
aspect. I expect that most other IDEs/editors that embed clang use UTF16
because it is used by Qt, Windows and Java internally. I doubt we are the
first to run into this issue.

If it turns out that we are actually the only ones with this issue so far then
I'll leave this issue unresolved for now. Too bad, but the effort required to
fix it from scratch seems to be quite high. I wonder whether the other IDEs
thought the same :wink:

Bye

What is the suggested way of handling this situation? Is there maybe prior
art somewhere to efficiently translate between UTF-8/UTF-16 code
locations that I could study?

What Jörg said.

You may want to look at ICU, see http://site.icu-project.org/

Keep in mind that Unicode is more than just dealing with UTF-8 encoding.
There is also:

  • Multiple characters that the user expect to count as one. Czech think
    that ch is a single character, so they will expect the cursor to skip
    over ch as if it were a single character. Maybe the Czech don’t mind if
    you don’t get it right for them, I don’t know - we’re in the murky
    waters of cultural expectations here.
  • Ligatures. I.e. single glyphs that are just two connected characters.
    You want to assume a character break inside the character.
  • Right-to-left scripts. Particularly nasty if RTL and LTR are mixed,
    you will end having to highlight discontinous screen areas.
  • Scripts where letters are written around the next letter. I forgot
    which script has this, it might be Devanagari.

Line wrapping has more fun of that kind.

You’ll need to decide of much of these things are relevant for your user
base.

Thanks guys,

I was aware of this. The question is more whether there is prior art in that
aspect. I expect that most other IDEs/editors that embed clang use UTF16
because it is used by Qt, Windows and Java internally. I doubt we are the
first to run into this issue.

If it turns out that we are actually the only ones with this issue so far then
I’ll leave this issue unresolved for now. Too bad, but the effort required to
fix it from scratch seems to be quite high. I wonder whether the other IDEs
thought the same :wink:

I’m not sure what you’re looking for - if you interface with a tool you need to convert the data you have into the format the tool expects and back.
You’ll need to convert your internal representation to utf8 and back anyway, so you’ll also need to convert the offsets from clang if it doesn’t give you character offsets already.

I was aware of this. The question is more whether there is prior art in that
aspect.

Well, ICU for being sure that the algorithms are really correct.
Not sure about prior art for keeping the offsets in sync. This sounds like a pretty standard editor task to me, which mostly follows from what data structures are already there and whether you

> I expect that most other IDEs/editors that embed clang use UTF16

because it is used by Qt, Windows and Java internally. I doubt we are the
first to run into this issue.

I don't know about Qt.
Windows GDI uses code pages, and UTF-16 for file names. No real offset stuff in that. The editor components are essentially opaque, you throw in the whole text, cr/lf and all, and let the component do its thing. Windows editors use their own routines I suppose, so no real Windows support anyway.
I don't know what Windows with .Net does.
For Java, you usually slurp in the full file and don't even think about what the original coding was, until you write stuff back.

If it turns out that we are actually the only ones with this issue so far then
I'll leave this issue unresolved for now. Too bad, but the effort required to
fix it from scratch seems to be quite high. I wonder whether the other IDEs
thought the same :wink:

Eclipse can handle UTF-8 files quite fine. I think it's going the standard route for Java code - but then source code files are usually small enough that you can easily keep them in the heap.

Editing multi-gigabyte logs is an entirely different issue, and it's surprisingly hard to find an editor that can handle this scenario without going brick mode.

So... question is: What's your use case actually? Is it feasible to read and convert the file in one go, and never bother keeping a relationship to file positions until you write it back?

Regards,
Jo

If you ignore the existence of UTF16 surrogate pairs, then the mapping is quite trivial and can be done very quickly.

E.g. Certain range blocks of UTF16 code units map to a certain number of UTF8 code units:

0x0000 - 0x007F → 1 code unit
0x0080 - 0x07FF → 2 code units
0x0800 - 0xFFFF → 3 code units

This allows you to quickly walk a line of UTF16 code units and get a corresponding UTF8 code unit location.

The converse is to check the high-order bits of the leading UTF8 code unit to see how many to skip over to walk across a single UTF16 code unit.

  • ½

If you ignore the existence of UTF16 surrogate pairs, then the mapping
is quite trivial and can be done very quickly.

E.g. Certain range blocks of UTF16 code units map to a certain number of
UTF8 code units:

0x0000 - 0x007F -> 1 code unit
0x0080 - 0x07FF -> 2 code units
0x0800 - 0xFFFF -> 3 code units

This allows you to quickly walk a line of UTF16 code units and get a
corresponding UTF8 code unit location.

The converse is to check the high-order bits of the leading UTF8 code
unit to see how many to skip over to walk across a single UTF16 code unit.

Thanks for the input!

The missing step then for me is an efficient way to access the contents of a
line. With clang-c, the only way I see is a costly clang_tokenize call. Is
there an on the C++ side of clang? I see SourceManager::getCharacterData -
would that be the right API to use? If so, I'll whip up a patch to make this
accessible via clang-c, such that we can build a somewhat efficient mapping
procedure on top of that.

Thanks

If you ignore the existence of UTF16 surrogate pairs, then the mapping
is quite trivial and can be done very quickly.

E.g. Certain range blocks of UTF16 code units map to a certain number of
UTF8 code units:

0x0000 - 0x007F → 1 code unit
0x0080 - 0x07FF → 2 code units
0x0800 - 0xFFFF → 3 code units

This allows you to quickly walk a line of UTF16 code units and get a
corresponding UTF8 code unit location.

The converse is to check the high-order bits of the leading UTF8 code
unit to see how many to skip over to walk across a single UTF16 code unit.

Thanks for the input!

The missing step then for me is an efficient way to access the contents of a
line. With clang-c, the only way I see is a costly clang_tokenize call. Is
there an on the C++ side of clang? I see SourceManager::getCharacterData -
would that be the right API to use? If so, I’ll whip up a patch to make this
accessible via clang-c, such that we can build a somewhat efficient mapping
procedure on top of that.

Don’t you already have the file as utf-8 so you can hand it into clang? Is there a reason not to get the line out of that format?

> > If you ignore the existence of UTF16 surrogate pairs, then the mapping
> > is quite trivial and can be done very quickly.
> >
> > E.g. Certain range blocks of UTF16 code units map to a certain number of
> > UTF8 code units:
> >
> > 0x0000 - 0x007F -> 1 code unit
> > 0x0080 - 0x07FF -> 2 code units
> > 0x0800 - 0xFFFF -> 3 code units
> >
> > This allows you to quickly walk a line of UTF16 code units and get a
> > corresponding UTF8 code unit location.
> >
> > The converse is to check the high-order bits of the leading UTF8 code
> > unit to see how many to skip over to walk across a single UTF16 code
>
> unit.
>
> Thanks for the input!
>
> The missing step then for me is an efficient way to access the contents of
> a
> line. With clang-c, the only way I see is a costly clang_tokenize call. Is
> there an on the C++ side of clang? I see SourceManager::getCharacterData -
> would that be the right API to use? If so, I'll whip up a patch to make
> this
> accessible via clang-c, such that we can build a somewhat efficient
> mapping
> procedure on top of that.

Don't you already have the file as utf-8 so you can hand it into clang? Is
there a reason not to get the line out of that format?

Only those files I pass in via CXUnsavedFile I have access to. All others are
opened directly by clang. Considering that Clang already has access to the
string contents of any file in the TU, that seems like the best approach for
me to access it, no?

From my quick glance over llvm/clang/include/clang/ Source Tree - Woboq Code Browser

Basic/SourceManager.h.html#clang::SourceManager
I see the following potential candidates:

  SourceManager::getBufferData
  SourceManager::getCharacterData
  SourceManager::getBuffer + MemoryBuffer API

Wouldn't those fill the gap? Or do you think I (and any other IDE) should
duplicate the code to find the contents of a given CXFile inside the TU, based
on either the CXUnsavedFile or an mmapped file from disk.

Thanks

If you ignore the existence of UTF16 surrogate pairs, then the mapping
is quite trivial and can be done very quickly.

E.g. Certain range blocks of UTF16 code units map to a certain number of
UTF8 code units:

0x0000 - 0x007F → 1 code unit
0x0080 - 0x07FF → 2 code units
0x0800 - 0xFFFF → 3 code units

This allows you to quickly walk a line of UTF16 code units and get a
corresponding UTF8 code unit location.

The converse is to check the high-order bits of the leading UTF8 code
unit to see how many to skip over to walk across a single UTF16 code

unit.

Thanks for the input!

The missing step then for me is an efficient way to access the contents of
a
line. With clang-c, the only way I see is a costly clang_tokenize call. Is
there an on the C++ side of clang? I see SourceManager::getCharacterData -
would that be the right API to use? If so, I’ll whip up a patch to make
this
accessible via clang-c, such that we can build a somewhat efficient
mapping
procedure on top of that.

Don’t you already have the file as utf-8 so you can hand it into clang? Is
there a reason not to get the line out of that format?

Only those files I pass in via CXUnsavedFile I have access to. All others are
opened directly by clang. Considering that Clang already has access to the
string contents of any file in the TU, that seems like the best approach for
me to access it, no?

From my quick glance over llvm/clang/include/clang/ Source Tree - Woboq Code Browser
Basic/SourceManager.h.html#clang::SourceManager
I see the following potential candidates:

SourceManager::getBufferData
SourceManager::getCharacterData
SourceManager::getBuffer + MemoryBuffer API

Wouldn’t those fill the gap? Or do you think I (and any other IDE) should
duplicate the code to find the contents of a given CXFile inside the TU, based
on either the CXUnsavedFile or an mmapped file from disk.

I’d have expected that you just read the files from disk yourself. I’d expect that to give fewer different code paths to do the same thing, so I’d hope it reduces complexity. But in reality I have no idea what I’m talking about as I don’t know your codebase :slight_smile: I don’t think that those design decisions can or should be made for all IDEs, so I’m not sure what other IDEs do is really relevant.

We do read the file ourselves when the user opens a file in the editor, but
that is only a small fraction of those files that get parsed via
clang_parseTranslationUnit2. The majority of files will be read directly from
disk from clang itself. The results we obtain from traversing the AST is then
cached, most notably this stores ranges that need to be highlighted if a file
gets opened eventually.

So that said, would you object against making any of the SourceManager::* API
public via a new clang-c function? Assuming of course they do what I expect
them to do, i.e. give me access to the file buffer (at a given position) that
clang saw while parsing the TU? It would certainly make this task more
efficient to implement for us.

Thanks

If you ignore the existence of UTF16 surrogate pairs, then the

mapping

is quite trivial and can be done very quickly.

E.g. Certain range blocks of UTF16 code units map to a certain

number of

UTF8 code units:

0x0000 - 0x007F → 1 code unit
0x0080 - 0x07FF → 2 code units
0x0800 - 0xFFFF → 3 code units

This allows you to quickly walk a line of UTF16 code units and get a
corresponding UTF8 code unit location.

The converse is to check the high-order bits of the leading UTF8
code
unit to see how many to skip over to walk across a single UTF16 code

unit.

Thanks for the input!

The missing step then for me is an efficient way to access the

contents of

a
line. With clang-c, the only way I see is a costly clang_tokenize

call. Is

there an on the C++ side of clang? I see

SourceManager::getCharacterData -

would that be the right API to use? If so, I’ll whip up a patch to
make
this
accessible via clang-c, such that we can build a somewhat efficient
mapping
procedure on top of that.

Don’t you already have the file as utf-8 so you can hand it into clang?

Is

there a reason not to get the line out of that format?

Only those files I pass in via CXUnsavedFile I have access to. All others
are
opened directly by clang. Considering that Clang already has access to the
string contents of any file in the TU, that seems like the best approach
for
me to access it, no?

From my quick glance over https://code.woboq.org/llvm/clang/include/clang/
Basic/SourceManager.h.html#clang::SourceManager
<https://code.woboq.org/llvm/clang/include/clang/Basic/SourceManager.h.htm
l#clang::SourceManager>>
I see the following potential candidates:
SourceManager::getBufferData
SourceManager::getCharacterData
SourceManager::getBuffer + MemoryBuffer API

Wouldn’t those fill the gap? Or do you think I (and any other IDE) should
duplicate the code to find the contents of a given CXFile inside the TU,
based
on either the CXUnsavedFile or an mmapped file from disk.

I’d have expected that you just read the files from disk yourself. I’d
expect that to give fewer different code paths to do the same thing, so I’d
hope it reduces complexity. But in reality I have no idea what I’m talking
about as I don’t know your codebase :slight_smile: I don’t think that those design
decisions can or should be made for all IDEs, so I’m not sure what other
IDEs do is really relevant.

We do read the file ourselves when the user opens a file in the editor, but
that is only a small fraction of those files that get parsed via
clang_parseTranslationUnit2. The majority of files will be read directly from
disk from clang itself. The results we obtain from traversing the AST is then
cached, most notably this stores ranges that need to be highlighted if a file
gets opened eventually.

So that said, would you object against making any of the SourceManager::* API
public via a new clang-c function? Assuming of course they do what I expect
them to do, i.e. give me access to the file buffer (at a given position) that
clang saw while parsing the TU? It would certainly make this task more
efficient to implement for us.

I’m probably still missing something: don’t you only need to load the file if there’s a result mentioning it and you want the user to open it?

Yes, but in KDevelop we parse all files of a project and cache the results.
Once you open a file it will show the results from that cache and we can also
use the locations in our cache for code browsing through the whole project to
jump e.g. to where classes are defined etc. pp.

This is done without ever loading any file in an editor. But we do run a lot
of clang_parseTranslationUnit2 calls which will internally open files from
disk. Then we visit the AST and get e.g. the position for a class declaration.
In order to convert that position, assuming the file is UTF-8 encoded, I want
to translate it to a UTF-16 position. For that I'd need efficient access to
either the full file buffer, or, nicer even, the line buffer for this
position.

Such an API that gives us direct access to the file/line buffer would also
allow us to remove some other places where we currently have to use
clang_tokenize for manual stringification of a range.

Thanks

Can't you convert to UTF-16 during load? Then you don't need to translate at all.
I'm under the impression that you are keeping an UTF-8 data blob in an environment that mostly talks UTF-16; in that case, the cleanest solution would be to have the data blob in UTF-16, too. Of course I don't know how much of your code base you'd have to touch to change that, this could be quite nasty or surprisingly easy.

What data blob are you referring to? I have the feeling we are talking past
each other in this discussion :wink:

on one hand I have:

for every file in given directory
  call clang_parseTranslationUnit
  traverse resulting AST
    for every interesting cursor
      store range of this cursor

The data blob we cache is a range[start(line, column), end(line, column)]. The
large code base expect this to be UTF-16 column offsets. Assuming the file is
encoded in UTF-8 on-disk then this is what I'll get from clang-c. For that
reason I'd like to convert it at this point. An API in clang-c for efficient
access to the underlying UTF-8 buffer of a given CXFile would help a lot for
that purpose (and in other scenarios we currently (ab)use clang_tokenize to
stringify a range).

So what I'm asking, again, is whether an API such as the following would be
acceptable:

CXString clang_getRangeSpelling(CXSourceRange range);

Thanks

This is done without ever loading any file in an editor. But we do run a
lot of clang_parseTranslationUnit2 calls which will internally open files
from disk. Then we visit the AST and get e.g. the position for a class
declaration. In order to convert that position, assuming the file is
UTF-8 encoded, I want to translate it to a UTF-16 position.

Can't you convert to UTF-16 during load? Then you don't need to
translate at all.
I'm under the impression that you are keeping an UTF-8 data blob in an
environment that mostly talks UTF-16; in that case, the cleanest
solution would be to have the data blob in UTF-16, too. Of course I
don't know how much of your code base you'd have to touch to change
that, this could be quite nasty or surprisingly easy.

What data blob are you referring to? I have the feeling we are talking past
each other in this discussion :wink:

Then I'm not seeing or missed where you have UTF-8 data.

on one hand I have:

for every file in given directory
  call clang_parseTranslationUnit
  traverse resulting AST
    for every interesting cursor
      store range of this cursor

The data blob we cache is a range[start(line, column), end(line, column)].

I understand that the column index values here may be wrong.
Is that correct?

> The large code base expect this to be UTF-16 column offsets.

Okay, then we're in the same boat with this.

> Assuming the file is

encoded in UTF-8 on-disk then this is what I'll get from clang-c.

Does clang-c consider each byte of a multibyte UTF-8 encoding as a character taking up its own column?
In that case, all you need to do is to file a bug :slight_smile:

I just stumbled upon clang::SourceManager.createExpansionLoc. I don't know how this will interact with macro expansion though.

> For that

reason I'd like to convert it at this point. An API in clang-c for efficient
access to the underlying UTF-8 buffer of a given CXFile would help a lot for
that purpose (and in other scenarios we currently (ab)use clang_tokenize to
stringify a range).

Hmm... I guess that would be exposing data structures that are currently internal. I have no idea how much the clang folks will like that; maybe there were plans to open that anyway, maybe they don't want to because there are other plans.
Oh. There is already clang::SourceManager.getMemoryBufferForFile.

So what I'm asking, again, is whether an API such as the following would be
acceptable:

CXString clang_getRangeSpelling(CXSourceRange range);

Somebody with more knowledge will have to answer that.
(I generally know a lot more about Unicode than about clang; my clang knowledge is very, very basic.)