On z/OS, there is a need to convert strings from UTF-8 to EBCDIC and vice versa. Namely:
All symbols in the GOFF binary file must be encoded in EBCDIC-1047.
Since LLVM uses UTF-8, a conversion is needed. This conversion must always be available.
Clang source files can be in an arbitrary encoding, and must be be translated to UTF-8.
Furthermore, aside from the z/OS target, a charset conversion function is needed when you want to implement -finput-charset= option like GCC.
We think that it is best that this target reside inside LLVM. This is because ie is used by both the SystemZ backend, the GOFF object writer/reader, and also the Clang front-end. Furthermore, there exists potential for it to be exploited by other FEs such as Flang in the future.
Specification:
To implement the use cases, a function is required which can convert a charset encoding to UTF-8 and vice versa.
As a special requirement when converting to/from EBCDIC-1047 is that the new line character 0A (1047) is mapped
to Unicode character \u0015 and vice versa. The standard mapping implemented in tools like iconv is to map the new
line character to Unicode character \u0025. Such converted files result in errors when read by clang.
Our proposed solution is to a new class CharSetConverter to perform the conversion. The basic idea is to
(1) use the POSIX iconv() function for the conversion. Depending on the platform, this might create a new external dependency.
(2) implement the EBCDIC-1047 ↔ UTF-8 based on a table, in order to fix the NL problem, and to make sure that the conversion is always available, regardless of target. Since this is used in the SystemZ backend, this ensures that the SystemZ backend can always be built on any host.
Use of iconv() function:
iconv() is part of POSIX.1-2017. On Linux, the iconv() functions are part of glibc. It is reasonable to assume that the use of that function is covered by the same license as other functions from glibc, e.g. strcpy(), which is also part of the POSIX.1-2017 standard. On MacOS, an implementation of the iconv() functionality is distributed with the system. Windows does not come with an iconv() library. This is sadly the same situation as with zlib, libxml2 and so on.
Several implementations exists, but quality and licensing may vary. An implementations for Windows could be based on the functions MultiByteToWideChar() and others provided by the WinAPI.
Alternatives:
Other conversion libraries exists, for example ICU. One drawback is that those libraries are not installed by default
on all platforms, including Linux. It also raises the question about the license. In contrast, glibc is already used by clang/LLVM.
I have no problem with this part. EBCDIC is a simple encoding, and adding support for it does not add undue burden to the project.
I do have a problem with this part. Can you please explain in more detail how this relates to z/OS and why this is not an unrelated “support arbitrary source file encodings” feature?
Because I would certainly prefer to just not have that feature, both from a philosophical and technical perspective:
I don’t think it’s the compiler’s job to convert charsets. You should convert your files to use UTF-8 on disk instead. I especially don’t see why we would need to add such a feature now in 2023 when non-UTF-8 encodings are increasingly irrelevant.
Arbitrary encoding conversion requires a third party library, which means that it must be an optional dependency. This means that availability of this feature would be inconsistent in practice. (This is already the case for things like compression support—we can do that, it’s just not great.)
As you mention, iconv quality varies a lot between implementations. The iconv you get on GNU Linux, and the iconv you get on something like Alpine or MacOS are entirely different things that happen to share a name. We made the mistake of exposing iconv bindings in PHP, and the many behavioral differences between iconv implementations were always a big problem. Our stance on this basically ended up being that if you want a working iconv, you need to install the GNU libiconv (e.g. from homebrew) and link against that instead of whatever the system provides.
Like I said on the previous RFC, host-specific differences inevitably cause problems. Because of that, I think conversions using iconv is a bad choice.
I’d prefer to separate “support EBCDIC-1047 encoding” from “support arbitrary encodings”. The use-cases are clearly separate, and as noted in the RFC the required “EBCDIC-1047” encoding doesn’t even exist in commonly used encoding conversion libraries.
I agree with other commenters that there are multiple considerations here.
One such consideration is support for the xlC #pragma filetag directive. Support for this directive would presumably be important to facilitate inclusion of system headers that are in an encoding that differs from the primary source file. I’d like to know what the plans are for such support.
I don’t think there is one conversion library that will work on all the systems that we need to support at present. ICU would be a good candidate but it hasn’t been maintained on z/OS for quite some time; see [ICU-21672] - Unicode Consortium. Perhaps that could change when/if Clang’s support for z/OS becomes sufficiently mature.
In the meantime, I think it would be reasonable to support multiple conversion libraries wrapped by an in-repo library. That library could be built to use an external library if available (ICU or Microsoft’s conversion facilities on Windows, ICU or iconv on POSIX systems, iconv on z/OS) and provide its own support for a small number of encodings (e.g., EBCDIC-1047). This would give vendors choices in how they build and package their Clang distribution (with all the associated benefits and burdens that brings). Perhaps the biggest downside of such an approach is that the set of recognized encoding names would differ from one distribution to another. That could perhaps be mitigated by defining the set of supported names in the in-repo library (likely following the IANA registered names) and mapping them to names as known by the various conversion libraries. I’ve worked on projects that used a similar approach in the past.
Unfortunately, that isn’t a feasible approach for most developers working on z/OS systems, at least not in my experience.
Thank you for putting together this RFC! I have some high-level thoughts:
I disagree with the philosophical position @nikic has, though I think he definitely has valid points. On the one hand, yes, it would be nice if the user fed us files in the encoding we prefer. But the job of the compiler is to accept as much (reasonable) code as it can. This is why we add so many extensions, especially ones from other compilers like GCC and MSVC, or downgrade some constraint violations in the language spec to be non-fatal warnings in the implementation. From that perspective, I think it makes a lot of sense for Clang to accept source from arbitrary encodings. Further, there are reasonable situations in which the user may not be able to change their source encoding. As an example: a large, older code base compiled with an out-of-service compiler that users want to transition to a modern compiler. If the older compiler is also opinionated about file encodings, the user will have a harder time picking Clang as their replacement compiler if Clang is differently-opinionated on the encoding. Also, non-UTF-8 code points are hardly uncommon; Windows is commonly a mixture of Windows Latin 1 (or some other system code page) and UTF-16, and is an important host platform for Clang to support well.
Adding new shared library dependencies to Clang makes deployment harder unless Clang degrades gracefully in the absence of the shared library. However, statically linking a text encoding library to make deployment easier is likely to add significant binary size overhead that could be problematic. I think we could get reasonable behavior by using a shared library and if the compiler cannot load the library or there is a versioning issue, we issue a diagnostic about being unable to support that encoding and pointing the user to our copious documentation that hand-holds them through the process of getting the right library installed for their system.
I agree with the concerns that iconv isn’t up to the task due to cross platform behavioral differences. It’s also not readily available on Windows. As I understand things, ICU is perhaps a more viable alternative, but I leave it to folks who know this landscape better than I do to make suggestions on what the most appropriate library would be for us if we accept this RFC. To me, the choice of library boils down to: availability on platforms we host Clang on, license for the library, user experience (binary size with a static lib vs all the problems that come from shared library use vs not having the functionality at all).
Overall, I support the idea of adding some sort of text encoding conversion functionality that is used by Clang so that we can accept more source files. Whether there’s a good, practical way to accomplish that still seems like a bit of an open question though.
In a previous life, I’ve had the (mis)pleasure of having to work with charset support. As much as I would like to, I don’t think I can endorse Nikita’s proposition that we only accept UTF-8 input. But the support of charsets is a tricky proposition, as non-Unicode charsets raise the possibility of untranslatable strings, and non-ASCII-compatible charsets raise all sorts of pain points: I can no more endorse a proposal that blindly wants to support every charset in ICU (itself a somewhat underdefined concept!).
It would be helpful, I think, to understand what the precise needs of non-UTF-8 source files are. When you have a non-UTF-8 source file, what should the behavior be for characters in identifiers (especially those that need to be emitted in output binaries)? For characters in “” or L"" strings (how about for \x or \u escapes in strings)? If we can say that for ASCII-compatible, non-UTF-8 source files, identifiers and strings need to be pure ASCII, then there is no need to do any actual conversion between, say, Windows-1252 and UTF-8.
Note that EBCDIC, which prompted this RFC, is non-ASCII-compatible, which creates entirely different sets of concerns. I think it is very reasonable that EBCDIC be handled differently from Windows-1252 or GB18030.
Thanks for making this RFC, sorry I didn’t see it sooner.
For the purpose of GOFF files, I’m happy with an EBCID-specific code path,
it seems reasonable and as @nikic observed, it’s a separate feature in a way.
I would like to know however if you considered making the GOFF file
format in clang recognize \025 ?
Not knowing anything about GOFF, it would make sense to me that it wouldn’t trip
on various line endings if that is at all doable.
For source conversion, I agree with others that we need a solution that:
Support a wide range of non-UTF-8 code bases
Support a wide range of platforms - recognizing that Microsoft platforms are often non-UTF-8
Is relatively consistent across platforms
Does not unreasonably complicate packaging and deployment due to binary size or licences concerns.
Given that iconv is not easily available on Windows, and people have raised concerns about the
inconsistencies iconv may have a different platforms, I would second investigating ICU4C
I would encourage some research into the deployment challenges that icu4c might offer.
Do we target platforms where it would not be available as a package?
If so, can we compile it as a static library in such a way that it does not add a large binary size
(most of the data ICU is built with by default is not relevant to our use cases)?
I think these questions need to be answered before we can pursue supporting other source file encodings
in clang, but I don’t think that should block work on GOFF or the backend.
I am wondering what the real minimal requirements are. LLVM is extremely picky about adding external dependencies. If this is really the minimal requirements, can you fulfil these requirements inside of LLVM without an external dependency?
I have absolutely no concerns in that regard.
The C++ committee spend the last few years resolving all of these concerns.
Supporting other encodings as source file would convert them to utf-8 (and lexing and everything after that will keep being UTF-8). We established that all characters that you could possibly want to use in a C++ source file have a representation in Unicode (C++ has absolutely no wiggle room to support non-unicode representable data). The only thing that may require some thought is how we map control characters in EBCDIC. But the proposed PR already has an answer for that.
Same things for the encoding in string literals, either the conversion succeed or it is ill-formed.
GCC has been using iconv and a similar architecture for a long while.
Most of the cost of the feature is in maintaining the conversion mechanism, the impact on Clang itself should be well manageable.
While I don’t have hard data about the current state of things, I’m confident that in the past, at least some Sony studios building PS4/PS5 games have used Shift-JIS environments. I know we have tests where we have to set up a Japanese Windows environment to be sure that works.
It feels both easier and “more correct” to just do the translation properly. If iconv is mis-translating certain control chars, we made our table that does this translation correctly, then we don’t have to worry about having Clang recognize anything.
Yes, specifically for this use-case, we don’t need any external dependency. EBCDIC-1047 supports all the legal characters for symbol names, so we can just use a pair of translation tables between EBCDIC and UTF-8 (really, EBCDIC and LATIN-1).
As mentioned above, ICU has its own holes in support (namely for EBCDIC), but we could work around this simply by preferring different conversion tools on different platforms (eg: z/OS and iconv, Windows and ICU). Would this be an acceptable solution to the iconv-nonavailability on Windows?
My primary concern with that approach is how compatible the results are across libraries. e.g., it would be really bad if the same input text and same target encoding produced different results depending on what platform the host compiler was on. However, I’m not certain how likely such an outcome is.
But in general, it’s not a problem using different libraries on different platforms. We have a reasonable number of APIs in llvm/Support that have a UNIX branch and a Windows branch that use different implementation strategies for platform-specific needs. Text encoding isn’t quite as platform-specific as threads or file system support, but it is tied somewhat to the platform in terms of what library offerings are available.
I have put up two patches related to this RFC, one to add a CharSetConverter wrapper class to ConverterEBCDIC ⚙ D153417 New CharSetConverter wrapper class for ConverterEBCDIC which my fexec-charset implementation relies on, and a separate patch to add iconv support to this class ⚙ D153418 Adding iconv support to CharSetConverter class which can probably be expanded to support other conversion libraries like ICU which I saw was mentioned in this discussion. Any feedback is appreciated on how I can improve this implementation. Thanks!