I’m not sure I understand. If the source file is UTF-8, but not a valid one, it must be rejected in translation phase 1. Or what did you mean by “utf-8 strings”?
Is there an issue in doing concatenation and translation at the same time? (This is what we’ve been doing so far.)
“\x80” for example can appear in a source file, is valid - even at phase 5 of translation, regardless of the source file, when the execution character set is UTF-8 (yet it is clearly not valid utf-8). That does not prevent us to check format string validity, emit diagnostics, etc
Consider "e" "\N{ACUTE ACCENT}" - lets say the execution encoding is latin 1. ACUTE ACCENT is not representable in latin 1, but the sequence e, ACUTE ACCENT is. So the order of operation matters.
But I’ve been thinking about that overnight. It would add a bit of complexity for something that is a theoretically, one that either GCC or iconv doesn’t support (I’ll have to check what ICU does). So… maybe we can ignore it!
Thanks, now I get it.
Suppose we have something like printf("\x80" "%llx\n", (int)var).
In order to properly analyze the format string we need to simulate the behavior of the real printf. It is easier to do so if the format string was translated, but in order to emit a good diagnostic we need to locate the ‘%’, which is only possible if we have the original literal(s) at hand. This is becoming more complicated than I thought…
Is that right? If \N{ACUTE ACCENT} is not a member of the execution charset, it should be replaced with a “replacement character”, e.g. ‘?’. Whether or not the contatenation has taken place seems irrelevant. Unless I misunderstand it, that is.
The existing code to handle this is in StringLiteral::getLocationOfByte; it re-parses the string literal to figure out the correct source location. Adapting it to deal with -fexec-charset correctly shouldn’t be that bad; I mean, actually counting bytes correctly is a bit tricky, but all the necessary information from the source code is easily accessible.
We don’t do NFC normalization normally; doing it when -fexec-charset is enabled would be confusing. If the user wants a latin-1 “e-with-acute-accent”, they should use the corresponding Unicode “e-with-acute-accent” character (U+00E9). (Unicode defines such a combined character for every character that exists in standardized non-Unicode encodings.)
I thought that maybe ICU would replace grapheme clusters when transcoding but that’s not the case, so given that nobody else does that, it’s not a problem we have to care about and transcoding before concatenation may be fine after all, which certainly make things easier!
By system charset, I was referring to the default encoding on the platform. For z/OS, the default is IBM-1047. Should it be system encoding instead?
And yes, I think it was also mentioned in this RFC [RFC] Adding a CharSet Converter to the LLVM Support Library that we wish to support non-UTF-8 source files as well.
I agree, this is something that is important to resolve. iconv is also POSIX-compliant and gcc’s fexec-charset uses it, so if Windows is the major platform that requires ICU, we can also default to iconv and fall back to ICU on platforms like Windows?
This came up on the other RFC on this topic and I don’t think we came to a resolution there. My questions boil down to:
It seemed there were concerns with using iconv from the other discussions, should we be using ICU as the default and then fall back to iconv only when ICU is not available, and gracefully fail if neither are available?
Do iconv and ICU produce the same results for the same inputs? If not, what is our plan to address the differences?
What is the cost of that dependency? Will it be statically linked or dynamically linked? If static, what does it do to binary sizes?
Are there downstream impacts for the dependency that we need to be concerned about?
What licensing concerns (if any) exist?
Is the intent to use the ICU from Windows 10 when available and fallback to an external ICU? Will we switch to the Windows 10 ICU when we drop support for older versions of Windows? What differences (if any) exist between Windows ICU and the third-party release of ICU?
If we need a different implementation for Windows from other platforms, are the Win32 APIs sufficient for the task so that we don’t need any extra dependencies on Windows?
This sounds like a good plan, although is there a strong enough demand for this feature to justify double the maintenance and packaging work? We have the internal EBCDIC convertor which should cover zOS,
and ICU should be able to accommodate most POSIX systems and most windows ones.
ICU produce self consistent results everywhere (same library everywhere and encodings are stable, they don’t really change over time) - iconv may produce different results in different platforms. This was mostly a concerns for EBCDIC line endings but for other scenarios it’s not something that I’m concerned about. Few encodings have duplicates - and when they do they represent the same abstract/semantic characters. Some different encoding share names across platforms but there is usually one of their name that is unique enough if that were a concern.
Let’s see how things work on windows 10 first, adding a second or third implementation (presumably
one based on MultiByteToWideChar( is probably not worth the trouble because we would most likely have to do quite a lot of manual work - Windows does not provide an easy way to map encoding names to code page enums, in particular.
But generally I agree with you. We should answer these question before pursuing any pull request.
I don’t think it’s a huge amount of work.
Researching how we can portably link to ICU, maybe talk to a few package maintainer and see how it behaves on windows seem to be good next steps.
Speaking as a packager of LLVM’s Windows binaries and Chromium’s toolchain, any new dependency is typically a fairly large pain. Sometimes they’re worth it, and I think that’s the most important question here.
Another important question is what impact this would have on Clang’s performance.
For z/OS, we wish to support more than just EBCDIC but don’t have ICU support which is why we want to also add the iconv support. But in general, I’m happy with using ICU as the default when it’s available.
Is the plan to put in the documentation something like the following?
“The supported encodings, and the names of those encodings, depend on what operating system you’re using, and how clang was packaged. If clang was configured to use iconv(), you can use iconv -l to list the supported encoding names. If clang was configured to use ICU, see ICU documentation (link). On Windows, see Microsoft documentation (link). If encoding support is disabled, only ‘utf-8’ is supported. Note that a given encoding name may refer to different encodings on different hosts. If you’re cross-compiling, the encodings you need are probably not supported.”
This seems unfriendly at best, but if there’s no existing portable library we can reasonably use, the only other alternative is to implement the encodings ourselves, I guess. (Implementing encodings is pretty easy for single-byte encodings, since it can be purely table-driven. Maybe a bit painful if we need to support East Asian encodings.)
Does anyone have a list of specific encodings they/their customers care about?
And then we would not be portable with anyone else. It’s also a massive amount of work (despite being not hard)
In practice, both ICU and iconv will have similar names for similar encodings, and only a few codepoints, for a few encodings for which multiple transformations are equally valid are not bytewise-identical.
The set of encoding will vary but given the use case is to produce the encoding of a platform where the program will be run, ithe only scenarios likely not to work is someone on ZOS trying to compile a program intended to work on windows (again, that also depends what zOS does to support iconv)
@abhina-sree In hope we reach some conclusion here, are you willing to investigate ICU, and its behavior on windows? We might need to dynamically open it on windows as we have to continue to support other windows version with the same binary.
The plan would be roughly, write a converter using ICU, investigate packaging and build and then add fexec-charset and deal with the fallout of having multiple encodings in clang, as I’m sure there are lots of small details that will surface.
I do not see a reason why we could not have an additional iconv backend when ICU is not available, as long as it does not negatively impact deployment and packaging.
But before investing lot of resources in a polished PR set, i think it would be enough for you to come back with answers to the questions asked in this thread in terms of feasibility and complexity. It’s important the legal questions related to use of 3rd parties libs get a clear answer too.
Does that sound good and reasonable to you? Thanks
I like the proposed plan. However, I admit I have never used ICU before and I may not be the most knowledgable to do this investigation, nor do I have a Windows environment setup to test so if anyone is willing to assist with this, I’d be grateful for the help. If not, I’ll try to do this investigation myself and give an update on this RFC when I can.
For the legal concerns, depending on the version, ICU is either covered by the Unicode licence, or ICU licence (which is stated to be compatible with GNU GPL). Iconv is also LGPL. So I don’t think that should be a problem for us.
CCing some folks on the LLVM Foundation board for awareness of legal exposures they may want to weigh in on: @tonic@beanz@rnk@akorobeynikov (Btw, do we have a better way to tag RFCs for “needs legal review”?)
I just heard privately that sending an email to board@llvm.org is the way to go for licensing questions, so I’ve taken the liberty of forwarding this thread over with a summary of what the question is. Specifically, I asked (and CCed @abhina-sree):
(I didn’t ask about iconv because that interface is part of POSIX and so I think we’re safe there. If folks think there’s a licensing question for iconv to be answered, feel free to chime in!)