RFC: Enabling fexec-charset support to LLVM and clang (Reposting)

My initial thought was this would not be a strong dependency and if the library is not available on the machine, the option should gracefully fail.

It seems like ICU is the preferred default for platforms, but this won’t work for z/OS. Is it acceptable to guard the use of iconv for z/OS only?

Is the plan to put in the documentation something like the following?

“The supported encodings, and the names of those encodings, depend on what operating system you’re using, and how clang was packaged. If clang was configured to use iconv(), you can use iconv -l to list the supported encoding names. If clang was configured to use ICU, see ICU documentation (link). On Windows, see Microsoft documentation (link). If encoding support is disabled, only ‘utf-8’ is supported. Note that a given encoding name may refer to different encodings on different hosts. If you’re cross-compiling, the encodings you need are probably not supported.”

This seems unfriendly at best, but if there’s no existing portable library we can reasonably use, the only other alternative is to implement the encodings ourselves, I guess. (Implementing encodings is pretty easy for single-byte encodings, since it can be purely table-driven. Maybe a bit painful if we need to support East Asian encodings.)

Does anyone have a list of specific encodings they/their customers care about?

I’d be fine with this, if it’s difficult to get it to find iconv on platforms where icu is not available

And then we would not be portable with anyone else. It’s also a massive amount of work (despite being not hard)

In practice, both ICU and iconv will have similar names for similar encodings, and only a few codepoints, for a few encodings for which multiple transformations are equally valid are not bytewise-identical.
The set of encoding will vary but given the use case is to produce the encoding of a platform where the program will be run, ithe only scenarios likely not to work is someone on ZOS trying to compile a program intended to work on windows (again, that also depends what zOS does to support iconv)

@abhina-sree In hope we reach some conclusion here, are you willing to investigate ICU, and its behavior on windows? We might need to dynamically open it on windows as we have to continue to support other windows version with the same binary.

The plan would be roughly, write a converter using ICU, investigate packaging and build and then add fexec-charset and deal with the fallout of having multiple encodings in clang, as I’m sure there are lots of small details that will surface.

I do not see a reason why we could not have an additional iconv backend when ICU is not available, as long as it does not negatively impact deployment and packaging.

But before investing lot of resources in a polished PR set, i think it would be enough for you to come back with answers to the questions asked in this thread in terms of feasibility and complexity. It’s important the legal questions related to use of 3rd parties libs get a clear answer too.

Does that sound good and reasonable to you? Thanks

I like the proposed plan. However, I admit I have never used ICU before and I may not be the most knowledgable to do this investigation, nor do I have a Windows environment setup to test so if anyone is willing to assist with this, I’d be grateful for the help. If not, I’ll try to do this investigation myself and give an update on this RFC when I can.

For the legal concerns, depending on the version, ICU is either covered by the Unicode licence, or ICU licence (which is stated to be compatible with GNU GPL). Iconv is also LGPL. So I don’t think that should be a problem for us.

CCing some folks on the LLVM Foundation board for awareness of legal exposures they may want to weigh in on: @tonic @beanz @rnk @akorobeynikov (Btw, do we have a better way to tag RFCs for “needs legal review”?)

1 Like

I just heard privately that sending an email to board@llvm.org is the way to go for licensing questions, so I’ve taken the liberty of forwarding this thread over with a summary of what the question is. Specifically, I asked (and CCed @abhina-sree):

Are there any licensing concerns with integrating use of ICU (icu/LICENSE at main · unicode-org/icu · GitHub) into LLVM as either a static or shared library?

(I didn’t ask about iconv because that interface is part of POSIX and so I think we’re safe there. If folks think there’s a licensing question for iconv to be answered, feel free to chime in!)

2 Likes

In response to @tahonermann 's comments here ⚙ D153418 Adding iconv support to CharSetConverter class about supporting ICU on z/OS,

We currently have no plan or resources allocated towards porting ICU on z/OS. Our users also rely on iconv for the system locales, but (and please correct me if I’m wrong) it seems like ICU does not use system locales so this may not meet our users’ needs. So we would still prefer to have iconv support available at the very least for z/OS, even if ICU is the preferred default.

My summary of the response was:

  • An optional, dynamic link dependency to ICU is fine, but a static or required dynamic link dependency on ICU would impact LLVM licensing due to ICU’s attribution requirement and is not permissible.
  • Use of iconv() is fine if it’s available on the system but distribution of an iconv implementation would have other licensing ramifications.

Technical discussion was that a good path forward is for LLVM to provide a generic interface that can be implemented internally with any of ICU/iconv/MultiByteToWideChar/etc with selection(s) made at compile time. The interface will need to fail gracefully in the case that an implementation is not available at runtime.

3 Likes

@abhina-sree, once -fexec-charset is enabled, how will conflicts in libraries be managed? For example, regular expression parsing would require a conception of the [, ], ^, |, and \ characters for syntax like [^0-9]|\\. These characters are not encoded in the same manner across all EBCDIC encodings.

I’m not sure i understand your question. Are you afraid these libraries use integers to represent characters? Otherwise all strings and characters literals are going to be encoded in ebcdic so it should just work as long as they are representable.

It’s likely that some libraries which expect ‘A’ ‘Z’ for example to be a linear contiguous ranges or which otherwise depend on the specific ordering of ASCII in some way will not be compatible with some encodings, but that’s already the cases with, eg, gcc.

I suspect Hubert’s concern is that the situation can result in something similar to an ODR violation. For example:

inline std::string replace(std::string input) {
  static std::regex re("[]^|\\");
  return std::regex_replace(input, re, "X");
}

When included from TUs that use a different literal encoding, the characters used in the regular expression may be interpreted differently by the std::regex constructor depending on how symbols end up getting linked or loaded.

Yes, exactly. Although I am less concerned about user-provided code in the TUs and more concerned about the uses of these characters in literals within <regex> or <format>.

The libc++ shared library also contains strings with such characters from at least libcxx/src/regex.cpp.

The encoding is part of the ABI, they would need a different set of compiled libraries, including a differently compiled standard library - and maybe a way to detect incompatibilities. I’m not sure there are many headers that could be reused without recompiling. , , , etc.
But I don’t think it was ever a goal to be able to run programs on systems which have a different basic execution character.
At the same time, I’m not sure we should warn on it because cross compiling should work

Hubert, can you describe how the traditional xlC compiler handles these cases? Or was it less of an issue because problematic libraries like <regex> and <format> that rely on these characters were not provided?

The only solution I’m so far able to conceive of involves (benign) ODR violations and the library headers providing different interfaces based on the active literal encoding.

A TR1 implementation of <regex> was provided, but I am not sure how much weight to put into what it does. From inspection, the shared library portion was free of characters in literals with variant encodings. The header goes with the literal encoding in effect at the point it was included. There does not appear to be any symbol versioning based on encoding. This means that the variant characters will match in the natural manner as long as the user does not use <regex> with more than one literal encoding in the same program.

I think this does bring a rather important point though.

Have we discussed whether we expect libc++, llvm-libc and the other various runtimes to officially support non-ASCII encodings?

The amount of work is probably not huge but it might not be trivial either. I think at a minimum we we would have to consider additional build bots.

@abhina-sree @ldionne @mordante

1 Like

Thanks for the ping @cor3ntin.

A few things I can think of for libc++, which aren’t mentioned yet.

The <locale> parsers are in the dylib. Parts of the locale handling is done by the underlying C library. For example we use strftime_l in libc++.

The <print> functions should be able to convert the input to Unicode for stdout, if it supports Unicode. When I implemented this, Clang always used UTF-8. We probably need to do something there too.

Another assumption libc++ makes is wchar_t('a') == L'a' is true. With this compiler flag are there cases where that assumption is no longer true?

There might places that make assumptions that az is a contiguous range. Which is true for ASCII, but false for EBCDIC.

In general in libc++ we only want to support features we have a buildbot for. We already use some AIX buildbots provided by IBM. I don’t know what needs to be done to test the character sets @abhina-sree mentioned. For new buildbots we don’t object to add a XFAIL. That at least gives an indication of what needs to be fixed to properly support this compiler option.

1 Like

Would a case where the wide literal encoding is UTF-32 and the ordinary literal encoding is “Shift_JIS” count as a case where “that assumption” doesn’t hold true (for the half-width katakana characters)? Or was the question specifically about characters in the basic character set?