Is the plan to put in the documentation something like the following?
“The supported encodings, and the names of those encodings, depend on what operating system you’re using, and how clang was packaged. If clang was configured to use iconv(), you can use iconv -l to list the supported encoding names. If clang was configured to use ICU, see ICU documentation (link). On Windows, see Microsoft documentation (link). If encoding support is disabled, only ‘utf-8’ is supported. Note that a given encoding name may refer to different encodings on different hosts. If you’re cross-compiling, the encodings you need are probably not supported.”
This seems unfriendly at best, but if there’s no existing portable library we can reasonably use, the only other alternative is to implement the encodings ourselves, I guess. (Implementing encodings is pretty easy for single-byte encodings, since it can be purely table-driven. Maybe a bit painful if we need to support East Asian encodings.)
Does anyone have a list of specific encodings they/their customers care about?
And then we would not be portable with anyone else. It’s also a massive amount of work (despite being not hard)
In practice, both ICU and iconv will have similar names for similar encodings, and only a few codepoints, for a few encodings for which multiple transformations are equally valid are not bytewise-identical.
The set of encoding will vary but given the use case is to produce the encoding of a platform where the program will be run, ithe only scenarios likely not to work is someone on ZOS trying to compile a program intended to work on windows (again, that also depends what zOS does to support iconv)
@abhina-sree In hope we reach some conclusion here, are you willing to investigate ICU, and its behavior on windows? We might need to dynamically open it on windows as we have to continue to support other windows version with the same binary.
The plan would be roughly, write a converter using ICU, investigate packaging and build and then add fexec-charset and deal with the fallout of having multiple encodings in clang, as I’m sure there are lots of small details that will surface.
I do not see a reason why we could not have an additional iconv backend when ICU is not available, as long as it does not negatively impact deployment and packaging.
But before investing lot of resources in a polished PR set, i think it would be enough for you to come back with answers to the questions asked in this thread in terms of feasibility and complexity. It’s important the legal questions related to use of 3rd parties libs get a clear answer too.
Does that sound good and reasonable to you? Thanks
I like the proposed plan. However, I admit I have never used ICU before and I may not be the most knowledgable to do this investigation, nor do I have a Windows environment setup to test so if anyone is willing to assist with this, I’d be grateful for the help. If not, I’ll try to do this investigation myself and give an update on this RFC when I can.
For the legal concerns, depending on the version, ICU is either covered by the Unicode licence, or ICU licence (which is stated to be compatible with GNU GPL). Iconv is also LGPL. So I don’t think that should be a problem for us.
CCing some folks on the LLVM Foundation board for awareness of legal exposures they may want to weigh in on: @tonic@beanz@rnk@akorobeynikov (Btw, do we have a better way to tag RFCs for “needs legal review”?)
I just heard privately that sending an email to firstname.lastname@example.org is the way to go for licensing questions, so I’ve taken the liberty of forwarding this thread over with a summary of what the question is. Specifically, I asked (and CCed @abhina-sree):
We currently have no plan or resources allocated towards porting ICU on z/OS. Our users also rely on iconv for the system locales, but (and please correct me if I’m wrong) it seems like ICU does not use system locales so this may not meet our users’ needs. So we would still prefer to have iconv support available at the very least for z/OS, even if ICU is the preferred default.
An optional, dynamic link dependency to ICU is fine, but a static or required dynamic link dependency on ICU would impact LLVM licensing due to ICU’s attribution requirement and is not permissible.
Use of iconv() is fine if it’s available on the system but distribution of an iconv implementation would have other licensing ramifications.
Technical discussion was that a good path forward is for LLVM to provide a generic interface that can be implemented internally with any of ICU/iconv/MultiByteToWideChar/etc with selection(s) made at compile time. The interface will need to fail gracefully in the case that an implementation is not available at runtime.
@abhina-sree, once -fexec-charset is enabled, how will conflicts in libraries be managed? For example, regular expression parsing would require a conception of the [, ], ^, |, and \ characters for syntax like [^0-9]|\\. These characters are not encoded in the same manner across all EBCDIC encodings.
I’m not sure i understand your question. Are you afraid these libraries use integers to represent characters? Otherwise all strings and characters literals are going to be encoded in ebcdic so it should just work as long as they are representable.
It’s likely that some libraries which expect ‘A’ ‘Z’ for example to be a linear contiguous ranges or which otherwise depend on the specific ordering of ASCII in some way will not be compatible with some encodings, but that’s already the cases with, eg, gcc.
When included from TUs that use a different literal encoding, the characters used in the regular expression may be interpreted differently by the std::regex constructor depending on how symbols end up getting linked or loaded.
The encoding is part of the ABI, they would need a different set of compiled libraries, including a differently compiled standard library - and maybe a way to detect incompatibilities. I’m not sure there are many headers that could be reused without recompiling. , , , etc.
But I don’t think it was ever a goal to be able to run programs on systems which have a different basic execution character.
At the same time, I’m not sure we should warn on it because cross compiling should work
Hubert, can you describe how the traditional xlC compiler handles these cases? Or was it less of an issue because problematic libraries like <regex> and <format> that rely on these characters were not provided?
The only solution I’m so far able to conceive of involves (benign) ODR violations and the library headers providing different interfaces based on the active literal encoding.
A TR1 implementation of <regex> was provided, but I am not sure how much weight to put into what it does. From inspection, the shared library portion was free of characters in literals with variant encodings. The header goes with the literal encoding in effect at the point it was included. There does not appear to be any symbol versioning based on encoding. This means that the variant characters will match in the natural manner as long as the user does not use <regex> with more than one literal encoding in the same program.
A few things I can think of for libc++, which aren’t mentioned yet.
The <locale> parsers are in the dylib. Parts of the locale handling is done by the underlying C library. For example we use strftime_l in libc++.
The <print> functions should be able to convert the input to Unicode for stdout, if it supports Unicode. When I implemented this, Clang always used UTF-8. We probably need to do something there too.
Another assumption libc++ makes is wchar_t('a') == L'a' is true. With this compiler flag are there cases where that assumption is no longer true?
There might places that make assumptions that a – z is a contiguous range. Which is true for ASCII, but false for EBCDIC.
In general in libc++ we only want to support features we have a buildbot for. We already use some AIX buildbots provided by IBM. I don’t know what needs to be done to test the character sets @abhina-sree mentioned. For new buildbots we don’t object to add a XFAIL. That at least gives an indication of what needs to be fixed to properly support this compiler option.
Would a case where the wide literal encoding is UTF-32 and the ordinary literal encoding is “Shift_JIS” count as a case where “that assumption” doesn’t hold true (for the half-width katakana characters)? Or was the question specifically about characters in the basic character set?