This is done without ever loading any file in an editor. But we do run a
lot of clang_parseTranslationUnit2 calls which will internally open files
from disk. Then we visit the AST and get e.g. the position for a class
declaration. In order to convert that position, assuming the file is
UTF-8 encoded, I want to translate it to a UTF-16 position.
Can't you convert to UTF-16 during load? Then you don't need to
translate at all.
I'm under the impression that you are keeping an UTF-8 data blob in an
environment that mostly talks UTF-16; in that case, the cleanest
solution would be to have the data blob in UTF-16, too. Of course I
don't know how much of your code base you'd have to touch to change
that, this could be quite nasty or surprisingly easy.
What data blob are you referring to? I have the feeling we are talking past
each other in this discussion
Then I'm not seeing or missed where you have UTF-8 data.
on one hand I have:
for every file in given directory
traverse resulting AST
for every interesting cursor
store range of this cursor
The data blob we cache is a range[start(line, column), end(line, column)].
I understand that the column index values here may be wrong.
Is that correct?
> The large code base expect this to be UTF-16 column offsets.
Okay, then we're in the same boat with this.
> Assuming the file is
encoded in UTF-8 on-disk then this is what I'll get from clang-c.
Does clang-c consider each byte of a multibyte UTF-8 encoding as a character taking up its own column?
In that case, all you need to do is to file a bug
I just stumbled upon clang::SourceManager.createExpansionLoc. I don't know how this will interact with macro expansion though.
> For that
reason I'd like to convert it at this point. An API in clang-c for efficient
access to the underlying UTF-8 buffer of a given CXFile would help a lot for
that purpose (and in other scenarios we currently (ab)use clang_tokenize to
stringify a range).
Hmm... I guess that would be exposing data structures that are currently internal. I have no idea how much the clang folks will like that; maybe there were plans to open that anyway, maybe they don't want to because there are other plans.
Oh. There is already clang::SourceManager.getMemoryBufferForFile.
So what I'm asking, again, is whether an API such as the following would be
CXString clang_getRangeSpelling(CXSourceRange range);
Somebody with more knowledge will have to answer that.
(I generally know a lot more about Unicode than about clang; my clang knowledge is very, very basic.)