Encoding files in utf8 on the fly?

Hello,

As far as I understand, Clang assumes that the source code it reads is encoded in UTF-8. Please, let me know if I'm wrong.

I'd like to use Clang to analyse some code base that can be encoded in about any encoding possible. I was thinking about two options to do this:

- Convert all files to UTF-8 before analysis. This might prove difficult, because I can't know beforehand what files will be opened by Clang. Moreover, I'd prefer not to modify the input source code if possible

- Convert the files on the fly while Clang loads them. This looks cleaner to me, even if it might be more performance intensive. From a quick look at the code, it looks like a good place to do this would be in VirtualFileSystem.cpp, with the classes File and RealFile.

Can you tell me if I'm on the right track?

Can you also tell me why doesn't Clang support -finput-charset? Is this just a question of performances, or is there another issue I'm missing?

Thank you for your help,

- Convert the files on the fly while Clang loads them. This looks
cleaner to me, even if it might be more performance intensive. From a
quick look at the code, it looks like a good place to do this would be
in VirtualFileSystem.cpp, with the classes File and RealFile.

Doing so can affect behavior if the source code uses characters outside
the basic source character set. Consider:

$ cat t.c
#include <string.h>
const char s = "À"; // where À is ISO-8859-1 0xC0.
int main() {
   return strlen(s);
}

$ clang t.c -o t; ./t; echo $?
t.c:2:18: warning: illegal character encoding in string literal
[-Winvalid-source-encoding]
const char *s = "<C0>";
                  ^~~~
1 warning generated.
1

$ iconv -f iso8859-1 -t utf-8 t.c > t2.c

$ clang t2.c -o t2; ./t2; echo $?
2

Whether that is a problem or not depends on the source code.

Can you tell me if I'm on the right track?

Can you also tell me why doesn't Clang support -finput-charset? Is this
just a question of performances, or is there another issue I'm missing?

My guess is that noone has yet been motivated enough to do the work.

Tom.

> - Convert the files on the fly while Clang loads them. This looks
> cleaner to me, even if it might be more performance intensive. From a
> quick look at the code, it looks like a good place to do this would be
> in VirtualFileSystem.cpp, with the classes File and RealFile.

Doing so can affect behavior if the source code uses characters outside
the basic source character set. Consider:

$ cat t.c
#include <string.h>
const char s = "À"; // where À is ISO-8859-1 0xC0.
int main() {
   return strlen(s);
}

$ clang t.c -o t; ./t; echo $?
t.c:2:18: warning: illegal character encoding in string literal
[-Winvalid-source-encoding]
const char *s = "<C0>";
                  ^~~~
1 warning generated.
1

$ iconv -f iso8859-1 -t utf-8 t.c > t2.c

$ clang t2.c -o t2; ./t2; echo $?
2

Whether that is a problem or not depends on the source code.

> Can you tell me if I'm on the right track?
>
> Can you also tell me why doesn't Clang support -finput-charset? Is this
> just a question of performances, or is there another issue I'm missing?

My guess is that noone has yet been motivated enough to do the work.

There is interest in pursuing this (and some ideas have been discussed);
however, as your code points out, support for
-finput-charset
and
-fexec-charset
is not a trivial endeavour.

-- HT

Can you also tell me why doesn't Clang support -finput-charset? Is this
just a question of performances, or is there another issue I'm missing?

My guess is that noone has yet been motivated enough to do the work.

The biggest problem is system / standard library / 3-rd party headers
in case when the input charset is incompatible with ASCII. The
simplest example is Shift-JIS where the yen symbol clashes with
backslash. Therefore it's very hard to derive the sane rules when the
conversion on fly should be performed or not...