request for windows unicode support

Hi!

Of course nobody wants to implement unicode support for windows
because windows should support an utf8-locale and windows is obsolete
anyway :wink:

But there is a simple solution: use boost::filesystem::path everywhere you
use file names and paths, for example in clang::FileManager::getFile.
With version 3 opening a file is easy: std::fstream file(path.c_str()).
Internally boost::filesystem::path uses the native encoding which is
utf16 for windows but you won't notice it since it recodes 8 bit strings
automatically (which is no-op on unix and macos).

If you don't want to become dependent on boost, I suggest reimplementing
the most important features always using 8 bit strings and then have something
like this:

#ifdef HAVE_BOOST
namespace fs = boost::filesystem;
#else
// simple implementation here
#endif

-Jochen

This happens to be very close to the code I’m working on now (I assume this post was prompted by my patches). I’ll be adding unicode support to the Windows implementation, however, paths will remain utf-8 encoded outside of System.

I really wish UCS-2 never existed ;/.

  • Michael Spencer

No, this post was prompted since I switched to boost::filesystem version 3 in my own code and llvm/clang 2.8
was the only lib with no unicode support on windows.
Will your code be api compatible to boost::filesystem? The reason for this is that maybe boost::filesystem
will become part of the standard and it is possible to imbue() a locale on boost::filesystem.
While this feature is not needed on unix/macos it gives you global control whether you want to use ansi or
unicode on windows.
If you implement your own code with always utf-8 this may break compatibility with windows ansi
encoding if you don’t take care and why reinvent the wheel? maybe you could even copy/paste the
boost implementation and use the #ifdef HAVE_BOOST approach.

-Jochen

No, this post was prompted since I switched to boost::filesystem version 3
in my own code and llvm/clang 2.8
was the only lib with no unicode support on windows.
Will your code be api compatible to boost::filesystem?

No. boost::filesystem makes extensive use of exceptions, which LLVM is
compiled without, and it does a lot of string allocation and copying.

The reason for this
is that maybe boost::filesystem
will become part of the standard and it is possible to imbue() a locale on
boost::filesystem.
While this feature is not needed on unix/macos it gives you global control
whether you want to use ansi or
unicode on windows.
If you implement your own code with always utf-8 this may break
compatibility with windows ansi
encoding if you don't take care and why reinvent the wheel? maybe you could
even copy/paste the
boost implementation and use the #ifdef HAVE_BOOST approach.

-Jochen

The conversion only has to be written once. And while I do like the
way boost::filesystem handles locale issues, the API is not suited for
LLVM for the above reasons. However, if you have a better design than
what I proposed, I'd love to see it. I'm not that familiar with
dealing with Unicode under Windows.

- Michael Spencer

Can't you just store filenames as UTF8 (like you do on Linux), and
convert UTF8 to widechar just when calling the windows APIs?
Same for converting back directory listings as such, you get widechar,
and convert back to UTF8.
All you would need to do is implement that conversion in System/Win32,
I think MultiByteToWideChar supports UTF8, doesn't it?

Best regards,
--Edwin

No. boost::filesystem makes extensive use of exceptions, which LLVM is
compiled without, and it does a lot of string allocation and copying.
   

Then why not reimplement an exception-less path class that is api-compatible with boost::filesystem::path
(i.e. omitting the rest of boost::filesystem) and of course more optimized if this is possible.
on the other hand: if it can be replaced by boost via #ifdef and then all of llvm has to be compiled with
exceptions, will the binary size increase much?

The conversion only has to be written once. And while I do like the
way boost::filesystem handles locale issues, the API is not suited for
LLVM for the above reasons. However, if you have a better design than
what I proposed, I'd love to see it. I'm not that familiar with
dealing with Unicode under Windows.
   

I have not looked at your solution yet. at least you should accept const wchar_t* under windows.
then it is possible to pass a boost path via path.c_str().

-Jochen

Can't you just store filenames as UTF8 (like you do on Linux), and
convert UTF8 to widechar just when calling the windows APIs?
Same for converting back directory listings as such, you get widechar,
and convert back to UTF8.
All you would need to do is implement that conversion in System/Win32,
I think MultiByteToWideChar supports UTF8, doesn't it?
   

I would think the most efficient approach is to use utf16 (i.e. wchar_t) internally on windows
(ohterwise utf8). Then if a path is used multiple times no conversion takes place. The conversion only
takes place at creation time when you create a path from utf8.

even if you have reasons not to use it, you should have a look at
www.boost.org/doc/libs/1_45_0/libs/filesystem/v3/doc/index.htm
www.boost.org/doc/libs/1_45_0/libs/filesystem/v3/doc/v3_design.html

-Jochen

Can't you just store filenames as UTF8 (like you do on Linux), and
convert UTF8 to widechar just when calling the windows APIs?
Same for converting back directory listings as such, you get widechar,
and convert back to UTF8.
All you would need to do is implement that conversion in System/Win32,
I think MultiByteToWideChar supports UTF8, doesn't it?

I would think the most efficient approach is to use utf16 (i.e. wchar_t)
internally on windows
(ohterwise utf8). Then if a path is used multiple times no conversion takes
place. The conversion only
takes place at creation time when you create a path from utf8.

The current API is stateless, meaning that the user is responsible for
the storage and format of paths. Thus there is no internal storage.
However, we could cache the conversion using a thread local limited
size LRU cache depending on how long the conversion takes. Storing
string as utf-16 would require changing them to utf-8 whenever the
client wanted to look at them, incurring lots of memory allocations
and copying.

even if you have reasons not to use it, you should have a look at
www.boost.org/doc/libs/1_45_0/libs/filesystem/v3/doc/index.htm
www.boost.org/doc/libs/1_45_0/libs/filesystem/v3/doc/v3_design.html

-Jochen

My design is based exactly off of that.

- Michael Spencer

I would think the most efficient approach is to use utf16 (i.e. wchar_t)
internally on windows
(ohterwise utf8). Then if a path is used multiple times no conversion takes
place. The conversion only
takes place at creation time when you create a path from utf8.
     

The current API is stateless, meaning that the user is responsible for
the storage and format of paths. Thus there is no internal storage.
However, we could cache the conversion using a thread local limited
size LRU cache depending on how long the conversion takes. Storing
string as utf-16 would require changing them to utf-8 whenever the
client wanted to look at them, incurring lots of memory allocations
and copying.
   

but for llvm an implementation of your API has to be provided?
As the conversion problem only exists on windows I would not try to optimize it and I doubt it
will be a performance bottleneck. And as you don't use the boost path my only wish is that
passing a boost path should be easy, e.g. by allowing const wchar_t* on windows.
Therefore utf-8 as internal format should be ok too.

-Jochen