Unicode path handling on Windows

I’m trying to fix unicode file handling on windows http://llvm.org/bugs/show_bug.cgi?id=10348. This currently doesn’t work because argv is encoded as multibyte string (clang project is configured this way).

Michael suggested converting command line to utf8, and this indeed solves the error that the driver emits, but there is another check in CompilerInstance that fails because FileSystemStatCache::get calls ::open and I’m guessing that this function is not smart enough to handle utf8 path on windows? Any ideas?

I have one more question. I added MultibyteToUTF8 function to PathV2.inc (windows version) and now I’d like to call it from ExpandArgv (driver.cpp) but this code is platform specific and isn’t visible (function is inside anonymous namespace). I could create a wrapper function that calls this function on windows and does nothing on other platforms. Is this the way to go, and where should I put it (llvm::sys::fs, llvm::sys::path or somewhere else)?

I’m trying to fix unicode file handling on windows http://llvm.org/bugs/show_bug.cgi?id=10348. This currently doesn’t work because argv is encoded as multibyte string (clang project is configured this way).

Michael suggested converting command line to utf8, and this indeed solves the error that the driver emits, but there is another check in CompilerInstance that fails because FileSystemStatCache::get calls ::open and I’m guessing that this function is not smart enough to handle utf8 path on windows? Any ideas?

It’s not smart enough no, but you can use _wfopen instead. Note that all of its arguments are wchar_t*

Ruben

_wopen expects wchar_t* and the only visible function for conversion to utf16 is ConvertUTF8toUTF32 which converts to unsigned shorts. There is a function that does exactly what I need called UTF8ToUTF16, but it’s inside an anonymous namespace inside windows version of PathV2.inc

I could solve this in a number of ways, but I’m not sure which one is preferred inside Clang codebase?

Typo, wanted to say ConvertUTF8toUTF16

_wopen expects wchar_t* and the only visible function for conversion to
utf16 is ConvertUTF8toUTF32 which converts to unsigned shorts.

If you're in #ifdef WIN32 code, just use ConvertUTF8toUTF16 and
reinterpret_cast from unsigned short* to wchar_t*.

-Eli

I think the problem is that PathV2.inc is part of LLVM, and the
ConvertUTF8ToUTF16 function is in an anonymous namespace. So the
question becomes: raise the function into an accessible namespace,
duplicate code, or find some other mechanism?

I don't think it makes sense to raise the function out of the
anonymous namespace unless it's also moved (it has nothing to do with
paths per se). Perhaps it's worth it to move it to StringRef?

~Aaron

Exactly, the problem is that I also need a function that converts from Multibyte to UTF8. I added it to PathV2.inc with other conversion functions and then created a wrapper around it inside llvm::sys::path in order to call it inside ExpandArgv in driver.cpp. I know that this is not the right place for it, and it seems that I also need the one that converts from utf8 to utf16. The question is whether I should raise them (they are windows only functions) and if yes where do they belong?

_wopen expects wchar_t* and the only visible function for conversion to
utf16 is ConvertUTF8toUTF32 which converts to unsigned shorts.

If you're in #ifdef WIN32 code, just use ConvertUTF8toUTF16 and
reinterpret_cast from unsigned short* to wchar_t*.

I think the problem is that PathV2.inc is part of LLVM, and the
ConvertUTF8ToUTF16 function is in an anonymous namespace. So the
question becomes: raise the function into an accessible namespace,
duplicate code, or find some other mechanism?

This function is also available in clang/lib/Basic/ConvertUTF.c

I don't think it makes sense to raise the function out of the
anonymous namespace unless it's also moved (it has nothing to do with
paths per se). Perhaps it's worth it to move it to StringRef?

~Aaron

_______________________________________________
cfe-dev mailing list
cfe-dev@cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev

-- Jean-Daniel

The function available in clang/lib/Basic/ConvertUTF.c deals with unsigned shorts, and I need wchar_t?

Guys, welcome to the too weird i18n world!
We, Japanese, has got suffered for multibyte charset for 20 years.

I have added a comment in http://llvm.org/bugs/show_bug.cgi?id=10348 .
Of course I know, I don't think it would be a practical resolution.

FYI, it seems clang can retrieve mbcs path with s/CP_UTF8/CP_ACP/g.

bin\clang.exe -S なかむら\たくみ.c

なかむら\たくみ.c:4:2: error: #error
#error
^
1 error generated.

Though, you should know, MBCS still has an issue;

bin\clang.exe -S 表はダメ文字\表はダメ文字.c

clang: error: no such file or directory: '表はダメ文字\表はダメ文字.c'
clang: error: no input files

Note, "表" is represented as "0x95 0x5C" in CP932.

In principle, IMHHHO;

  - argv should be treated as "blackbox" byte stream.
  - Don't assume "wmain(argc, wchar_t **argv)". mingw does not have
one. Then, argv must be presented as the default codepage.
  - A few codepage (eg. cp932 Japanese shift jis) might contain
0x5C(\) in 2nd (leading) octet.

Win32 ANSI (****A) APIs assume local codepage.

We should do in llvm;

  - Treat pathstring in argv as blackbox. Never parse
(char*)pathstring without any knowledge.
  - UTF8 would be useless on win32. Win32 does not manipulate utf8
implicitly in anywhere.
  - Path API should hold pathstring as API-native form (bytestream on
unix, UCS2 wchar_t on win32).
  - Path should be manipulated as API-native form as possible.

In future, we might consider "-finput-charset" and "-fexec-charset" on clang.
Please consider an source file;

////////
#include "むすめは/まおちゃん.h"
char const literal = "俺です、俺俺";
////////

The include path (#include) should be handled as host-dependent. The
literal should be interperted with input-charset and be emitted with
exec-charset.

Too hard the life is. Would you like to live in Japan? :stuck_out_tongue:

...Takumi

Guys, welcome to the too weird i18n world!
We, Japanese, has got suffered for multibyte charset for 20 years.

I have added a comment in http://llvm.org/bugs/show_bug.cgi?id=10348 .
Of course I know, I don’t think it would be a practical resolution.

FYI, it seems clang can retrieve mbcs path with s/CP_UTF8/CP_ACP/g.

bin\clang.exe -S なかむら\たくみ.c
なかむら\たくみ.c:4:2: error: #error
#error
^
1 error generated.

Though, you should know, MBCS still has an issue;

bin\clang.exe -S 表はダメ文字\表はダメ文字.c
clang: error: no such file or directory: ‘表はダメ文字\表はダメ文字.c’
clang: error: no input files

Note, “表” is represented as “0x95 0x5C” in CP932.

In principle, IMHHHO;

  • argv should be treated as “blackbox” byte stream.
  • Don’t assume “wmain(argc, wchar_t **argv)”. mingw does not have
    one. Then, argv must be presented as the default codepage.

Correction: I believe MinGW-w64 has a Unicode startup and thus support for wmain (but of course it would be better to shift this to strict API functions)

  • A few codepage (eg. cp932 Japanese shift jis) might contain
    0x5C() in 2nd (leading) octet.

Win32 ANSI (****A) APIs assume local codepage.

We should do in llvm;

  • Treat pathstring in argv as blackbox. Never parse
    (char*)pathstring without any knowledge.
  • UTF8 would be useless on win32. Win32 does not manipulate utf8
    implicitly in anywhere.
  • Path API should hold pathstring as API-native form (bytestream on
    unix, UCS2 wchar_t on win32).
  • Path should be manipulated as API-native form as possible.

Isn’t it more straightforward to use utf-8 internally and use the conversion functions provided by the win32 API when calling other win32 API functions, and always call the wide versions of the win32 functions. Full compatibility guaranteed, and one encoding internally.

Ruben

AFAIK Clang internals do assume utf8, and llvm::sys::path converts strings to utf16 on windows and calls W API functions.

If somebody would like to take a look at my changes and comment on them. Here’s a brief explanation of what I did:

  • Convert argv to utf8 using current system locale for win32 (this is done as soon as possible inside ExpandArgv). This makes the driver happy since calls to llvm::sys::path::exists succeed.
  • Change calls to ::open (inside FileSystemStatCache and MemoryBuffer) to ::_wopen on win32 by converting the path to utf16.
  • In order to do the conversions I had to expose two functions, one of them was already there but wasn’t visible, the other one was added by me

Known issues:

  • I should probably use LLVM_ON_WIN32 instead of WIN32 but this macro isn’t defined inside FileSystemStatCache and MemoryBuffer for some reason. Both of these files have an #ifdef section that deals with O_BINARY so maybe these two sections should be consolidated?
  • Functions convert_multibyte_to_utf8 and convert_utf8_to_utf16 have definitions only on windows so every other platform is currently broken.

unicode_path_clang.patch (1.77 KB)

unicode_path_llvm.patch (2.9 KB)

One issue is that filenames on Windows can include Unicode characters not supported by the current code page, so the filenames in const char *argv aren’t necessarily usable. The solution is to avoid argv and instead use the Windows API:

#include <ShellAPI.h> // for CommandLineToArgvW
#include
#include
#include

int main() {
#ifdef WIN32

// get UTF-16 encoded wchar_t arguments

LPWSTR *szArglist;
int argc;
szArglist = CommandLineToArgvW(GetCommandLineW(),&argc);
if(NULL==szArglist) {
std::cerr << “CommandLineToArgvW failed\n”;
}

// convert to UTF-8 encoded char arguments (C++11)

std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>,wchar_t> convert;
std::vectorstd::string args;
for(int i=0;i<argc;++i) {
args.push_back(convert.to_bytes(szArglist[i]));
}
#endif //ifdef WIN32

}

One issue is that filenames on Windows can include Unicode characters not supported by the current code page, so the filenames in const char *argv aren’t necessarily usable. The solution is to avoid argv and instead use the Windows API:

#include <ShellAPI.h> // for CommandLineToArgvW
#include
#include
#include

int main() {
#ifdef WIN32

// get UTF-16 encoded wchar_t arguments

LPWSTR *szArglist;
int argc;
szArglist = CommandLineToArgvW(GetCommandLineW(),&argc);
if(NULL==szArglist) {
std::cerr << “CommandLineToArgvW failed\n”;
}

// convert to UTF-8 encoded char arguments (C++11)

std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>,wchar_t> convert;
std::vectorstd::string args;
for(int i=0;i<argc;++i) {
args.push_back(convert.to_bytes(szArglist[i]));
}
#endif //ifdef WIN32

}

Windows has an API for that: check for an almost ready-made solution here: http://stackoverflow.com/questions/2181205/utf-8-to-from-utf-16-problem (read the answer as well for the proper function call, I don’t have access to my own implementation right now). No need for fancy c++11 here.

In principle, IMHHHO;

- argv should be treated as "blackbox" byte stream.
- Don't assume "wmain(argc, wchar_t **argv)". mingw does not have
one. Then, argv must be presented as the default codepage.

Correction: I believe MinGW-w64 has a Unicode startup and thus support for
wmain (but of course it would be better to shift this to strict API
functions)

Good to hear. Frankly speaking, though, I don't know little knowledge
to wmain() scheme...

We should do in llvm;

- Treat pathstring in argv as blackbox. Never parse
(char*)pathstring without any knowledge.
- UTF8 would be useless on win32. Win32 does not manipulate utf8
implicitly in anywhere.
- Path API should hold pathstring as API-native form (bytestream on
unix, UCS2 wchar_t on win32).
- Path should be manipulated as API-native form as possible.

Isn't it more straightforward to use utf-8 internally and use the conversion
functions provided by the win32 API when calling other win32 API functions,
and always call the wide versions of the win32 functions. Full compatibility
guaranteed, and one encoding internally.

I could propose one if conversion of ansi->utf8 would be supported by win32.
Now, I rethought it might be an option to hold utf8 internally.

...Takumi

Nikola,

Your patchset does not work;

bin\clang.exe -S なかむら\たくみ.c

error: error reading '邵コ・ェ邵コ荵昴・郢ァ蝎らクコ貅假ソ・邵コ・ソ.c'
1 error generated.

  - Would it be not enough in somewhere?
    I suspect clang still might be pathv1-dependent.
    (I guess, pathv1 would assume ansi)
  - raw_ostream does not handle utf8, but ansi, on win32.

I would like to propose;

  - converting utf8 and utf16 may move to llvm/lib/Support.
  - we may get rid of CP_UTF8 with Win32 API. It must be trivial.

ps. excuse me, I might respond you more, later. (oops lunch time was over...)

...Takumi

2011/9/2 NAKAMURA Takumi <geek4civic@gmail.com>

Nikola,

Your patchset does not work;

bin\clang.exe -S なかむら\たくみ.c

How can your filename have a backslash?

It’s not a filename, it’s a path, nevermind :slight_smile:

The patch should work for unicode filename, I just realized that it doesn’t work for unicode directories. FileSystemStatCache calls ::stat for directories, and this doesn’t work for utf8 input the same way ::open doesn’t work. I tried to replace it with ::_wstat but this function has a different signature. I think we should take a different approach:

  1. convert all command line input to utf8
  2. rework FileSystemStatCache and MemoryBuffer to use llvm::sys::fs and never explicitly call ::open or ::stat

llvm::sys::fs already has a status function but I’m not sure if it can be used as ::stat replacement?
Can this module be used to open files, I couldn’t find this anywhere?

2011/9/2 NAKAMURA Takumi <geek4civic@gmail.com>

I think I got it this time. I realized that ::open and ::stat work just fine with multibyte paths on windows so there’s no need to change this code. The only problem is llvm::sys::fs module which falsely assumes that input strings are UTF8 encoded when they are in fact multibyte strings.

Now I really hope I haven’t broken anything because llvm::sys::fs::exists is called in a number of places, but I’m guessing that none of the paths that are passed to it are really UTF8?

I think entire llvm::sys::fs module should be changed to use MultibyteToUTF16 instead of UTF8ToUTF16 before calling windows api functions (unless somebody knows that we actually have UTF8 paths on windows somewhere in the code)?

PathV2.inc.patch (1.43 KB)