RFC: Enabling fexec-charset support to LLVM and clang (Reposting)

I’m reposting this RFC (with a few updates) I previously made for fexec-charset a couple years ago.

We wish to implement the fexec-charset option to enable support for different execution character sets outside of the default UTF-8. Our proposal is to use UTF-8 as the internal charset. All input source files will be converted to this charset. One side effect is that the file buffer size may change after the translation between single and multi-byte character sets. This Phabricator patch ⚙ D153419 Enable fexec-charset option shows our initial implementation plan. After this patch, we also wish to add a pragma convert which will allow the user to control the execution charset within the pragma convert directive and pragma convert pop.

Example:
#pragma convert(“ISO8859-1”)
char *text = “Hello World”;
#pragma convert(pop)

First, we create converters using the CharSetConverter class (⚙ D153417 New CharSetConverter wrapper class for ConverterEBCDIC) which uses the underlying ConverterEBCDIC class for conversion between the internal charset and the system or execution charset, and vice versa. In the CharSetConverter class, some conversions are provided, but for the unsupported conversions, the class will attempt to create the converter using the iconv library if it exists (⚙ D153418 Adding iconv support to CharSetConverter class). Since the iconv library differs between platforms, we cannot guarantee that the behaviour will be the same across all platforms.

Then during the parsing stage we translate the string literals using this converter. Translation cannot be performed after this stage because the context is lost and there is no difference between escaped characters and normal characters. The translated string will be shown in the IR readable format.

In addition, we wish to control translation for different types of strings. In our plan, we introduce an enum, ConversionAction, to indicate which output codepage is needed. The table below summarizes what codepage different kinds of string literals should be in:

╔══════════════════════════════════════╤═════════════════════════════════╗
║Context                               │Output Codepage                  ║
╠══════════════════════════════════════╪═════════════════════════════════╣
║asm("...")                            │system charset (IBM-1047 on z/OS)║
╟──────────────────────────────────────┼─────────────────────────────────╢
║typeinfo name                         │system charset (IBM-1047 on z/OS)║
╟──────────────────────────────────────┼─────────────────────────────────╢
║#include "fn" or #include <fn>        │system charset (IBM-1047 on z/OS)║
╟──────────────────────────────────────┼─────────────────────────────────╢
║__FILE__ , __func__                   │-fexec-charset                   ║
╟──────────────────────────────────────┼─────────────────────────────────╢
║literal, string & char                │-fexec-charset                   ║
╟──────────────────────────────────────┼─────────────────────────────────╢
║user-defined literal                  │-fexec-charset                   ║
╟──────────────────────────────────────┼─────────────────────────────────╢
║extern "C" ...                        │n/a (internal charset, UTF-8)    ║
╟──────────────────────────────────────┼─────────────────────────────────╢
║_Pragma("message(...)”)               │n/a (internal charset, UTF-8)    ║
╟──────────────────────────────────────┼─────────────────────────────────╢
║attribute args (GNU, GCC)             │n/a (internal charset, UTF-8)    ║
╟──────────────────────────────────────┼─────────────────────────────────╢
║module map string literals (-fmodules)│n/a (internal charset, UTF-8)    ║
╟──────────────────────────────────────┼─────────────────────────────────╢
║line directive, digit directive       │n/a (internal charset, UTF-8)    ║
╚══════════════════════════════════════╧═════════════════════════════════╝

Several complexities arise when string literals are inspected after translation. In these later stages of compilation, there is an underlying assumption that the literals are encoded in UTF-8 when in fact they can be encoded in one of many charsets. Listed below are three instances where this assumption can be found.

  1. During printf/scanf format string validation, format specifiers in UTF-8 are searched for but are not found because the string will already be translated.
  2. There are certain optimizations after the parsing stage (e.g. in SimplifyLibCalls.cpp, printf(“done\n”) gets optimized to puts(“done”) if the string ends in ‘\n’) which will not work if the string is already translated.
  3. When generating messages (e.g. pragma message, warnings, errors) the message (usually in UTF-8) may be merged with string literals which may be translated, resulting in a string with mixed encoding.
  4. When using VerifyDiagnosticConsumer to verify expected diagnostics, the expected text is in UTF-8 but the message can refer to string constants and literals which are translated and cannot match the expected text.

Currently, we see no other way than to reverse the translation, disable this optimization or stop certain translations when the string is assumed to be encoded in UTF-8 to resolve these complexities. Although reversing translation may not yield the original string, it can be used to locate format specifiers which are guaranteed to be correctly identified.

Any feedback on this implementation is welcome.

Thanks,
Abhina

Approach generally looks reasonable.

For analyzing strings passed to libc functions like printf/scanf/etc, the analysis needs to know what encoding libc is using, yes, and adjust appropriately.

Do we need to encode the execution charset into the IR somehow? SimplifyLibCalls is an IR optimization, not part of the frontend, so it can’t see the setting in LangOptions. Maybe a function attribute would be appropriate.

I’m confused what you expect VerifyDiagnosticConsumer to do; diagnostics that refer to string constants/literals should translate the constants/literals to UTF-8, so tests should just work. (If there are specific tests that somehow implicitly assume the exec-charset is UTF-8, we can fix those tests to explicitly pass “-fexec-charset=utf-8”, but that should only apply to tests that are specifically trying to test Unicode support.)

+1, looks like we should. For where to store it, consider the following example:

// Compiled with -fexec-charset=utf-8
void my_printf(const char *str) { printf(str); }
// Compiled with -fexec-charset=exotic
extern void my_printf(const char *str);

void test() { my_print("Hello\n"); }

After LTO and inlining this will become:

void test() { printf("Hello\n"); }

What’s should SimplifyLibCalls do here? The string literal comes from module created with -fexec-charset=exotic, but printf was originally called from the module created with -fexec-charset=utf-8.
I guess it would be easier to store the encoding in module metadata and prohibit linking modules created with different -fexec-charset options.

Yes I think that’s a good idea. With the future pragma convert feature, it will be possible to have multiple execution charsets in the same file so a module flag or function attribute may not be fine-grained enough. But for the current fexec-charset patch I have, a flag will be enough to encode the execution charset in the IR.

Yes, I think I wrote this point with the consideration that in the future, there may be testcases that implicity assume UTF-8, but as you mentioned those can be fixed.

Right this will definitely be an issue. We can give a warning/error about mismatched encodings or we can leave it to the user’s discretion.

This only works on some posix-like system, and clang strives to work on Windows (and OSX, but I can’t imagine this feature being useful there). This need to be addressed or at least acknowledged - and if there is community consensus to go in that direction, that’s fine.

asm(“…”)

This isn’t clear to me, it should be in whatever encoding the linker exist, shouldn’t it?
IE, what will lld do there?

typeinfo name

Shouldn’t that be the execution charset?

#include "fn" or #include <fn>

Here be dragons. IE, by converting to UTF-8 before lexing, we might alter the byte representation when converting back, such that we can’t open the files.
I don’t believe there is any way to sanely handle this hypothetical. Let’s pretend this is not a actual issue,
people really should not be creative with their file names.

user-defined literal

I believe these should be kept in UTF-8.

attribute args (GNU, GCC)

This is not actually entirely clear to me. some attributes do expect a string literal in which case it’s UTF-8.
But some expect an evaluated string, in that case it will be in the execution encoding (ie fexec charset)

Currently, we see no other way than to reverse the translation, disable this optimization or stop certain translations when the string is assumed to be encoded in UTF-8 to resolve these complexities. Although reversing translation may not yield the original string, it can be used to locate format specifiers which are guaranteed to be correctly identified.

We could keep around the original (UTF-8) string, but all the encodings C++ support round trip in a semantic preserving (but not necessarily byte-preserving) way through Unicode, so all use cases we want to support should work.

There might be issues with the fact we don’t have normalization facilities in clang yet, so we can consider scenarios in which round tripping through a different encoding would affect identifiers, but there is currently no scenario in which this would be an issue.

In the future, we will need to convert execution encoding strings into UTF-8 for example in the context of reflection in C++ (should that materialize), or as part of https://isocpp.org/files/papers/P2741R3.pdf ( accepted C++26 feature)

Note ⚙ D105759 Implement P2361 Unevaluated string literals, which intends to clarify some of these encoding questions (sadly, not all of them).

Globally, I agree with the general architecture, however we really do need to resolve the iconv related questions before progressing this.

Would defaulting to ICU and falling back to iconv on platforms where ICU is not available (mostly z/OS) be something the community be fine with?

The second column is a bit confusing. “charset” and “encoding” are two orhtogonal things (and there is no “system charset”). The other thing that confuses me is mentioning IBM-1047 encoding in the first three rows. Do you plan to support non-UTF-8 source files in clang and / or add another option controlling assembly output encoding?

Note that ‘\x’ in a string literal can create invalid code points, making the translation irreversible. I guess we well have to keep the original (UTF-8) sring for diagnostics / AST / other uses in clang.
Alternatively, we could invent a data structure that could precisely represent source string literals with all '\u’s and '\x’es untouched and literals unconcatenated (source-to-source translations could benefit from this…).

Good point, thanks for reminding me of that. But does it change anything? \x are replaced after transcoding - so utf-8 strings may also be invalid already, i don’t think this changes anything.

To correctly implement the standards we need to first concat, then encode, then replace numerical sequences, but they must not be allowed to cross concatenation boundaries, so it’s a bit tricky.

I’m not sure I understand. If the source file is UTF-8, but not a valid one, it must be rejected in translation phase 1. Or what did you mean by “utf-8 strings”?

Is there an issue in doing concatenation and translation at the same time? (This is what we’ve been doing so far.)

“\x80” for example can appear in a source file, is valid - even at phase 5 of translation, regardless of the source file, when the execution character set is UTF-8 (yet it is clearly not valid utf-8). That does not prevent us to check format string validity, emit diagnostics, etc

Consider "e" "\N{ACUTE ACCENT}" - lets say the execution encoding is latin 1. ACUTE ACCENT is not representable in latin 1, but the sequence e, ACUTE ACCENT is. So the order of operation matters.

But I’ve been thinking about that overnight. It would add a bit of complexity for something that is a theoretically, one that either GCC or iconv doesn’t support (I’ll have to check what ICU does). So… maybe we can ignore it!

Thanks, now I get it.
Suppose we have something like printf("\x80" "%llx\n", (int)var).
In order to properly analyze the format string we need to simulate the behavior of the real printf. It is easier to do so if the format string was translated, but in order to emit a good diagnostic we need to locate the ‘%’, which is only possible if we have the original literal(s) at hand. This is becoming more complicated than I thought…

Is that right? If \N{ACUTE ACCENT} is not a member of the execution charset, it should be replaced with a “replacement character”, e.g. ‘?’. Whether or not the contatenation has taken place seems irrelevant. Unless I misunderstand it, that is.

The existing code to handle this is in StringLiteral::getLocationOfByte; it re-parses the string literal to figure out the correct source location. Adapting it to deal with -fexec-charset correctly shouldn’t be that bad; I mean, actually counting bytes correctly is a bit tricky, but all the necessary information from the source code is easily accessible.

We don’t do NFC normalization normally; doing it when -fexec-charset is enabled would be confusing. If the user wants a latin-1 “e-with-acute-accent”, they should use the corresponding Unicode “e-with-acute-accent” character (U+00E9). (Unicode defines such a combined character for every character that exists in standardized non-Unicode encodings.)

1 Like

I thought that maybe ICU would replace grapheme clusters when transcoding but that’s not the case, so given that nobody else does that, it’s not a problem we have to care about and transcoding before concatenation may be fine after all, which certainly make things easier!

By system charset, I was referring to the default encoding on the platform. For z/OS, the default is IBM-1047. Should it be system encoding instead?
And yes, I think it was also mentioned in this RFC [RFC] Adding a CharSet Converter to the LLVM Support Library that we wish to support non-UTF-8 source files as well.

I agree, this is something that is important to resolve. iconv is also POSIX-compliant and gcc’s fexec-charset uses it, so if Windows is the major platform that requires ICU, we can also default to iconv and fall back to ICU on platforms like Windows?

This came up on the other RFC on this topic and I don’t think we came to a resolution there. My questions boil down to:

  • It seemed there were concerns with using iconv from the other discussions, should we be using ICU as the default and then fall back to iconv only when ICU is not available, and gracefully fail if neither are available?
  • Do iconv and ICU produce the same results for the same inputs? If not, what is our plan to address the differences?
  • ICU is not a system library on all versions of Windows we support, though it is supported on Windows 10, so it adds an extra dependency.
    • What is the cost of that dependency? Will it be statically linked or dynamically linked? If static, what does it do to binary sizes?
    • Are there downstream impacts for the dependency that we need to be concerned about?
    • What licensing concerns (if any) exist?
    • Is the intent to use the ICU from Windows 10 when available and fallback to an external ICU? Will we switch to the Windows 10 ICU when we drop support for older versions of Windows? What differences (if any) exist between Windows ICU and the third-party release of ICU?
  • If we need a different implementation for Windows from other platforms, are the Win32 APIs sufficient for the task so that we don’t need any extra dependencies on Windows?

This sounds like a good plan, although is there a strong enough demand for this feature to justify double the maintenance and packaging work? We have the internal EBCDIC convertor which should cover zOS,
and ICU should be able to accommodate most POSIX systems and most windows ones.

ICU produce self consistent results everywhere (same library everywhere and encodings are stable, they don’t really change over time) - iconv may produce different results in different platforms. This was mostly a concerns for EBCDIC line endings but for other scenarios it’s not something that I’m concerned about. Few encodings have duplicates - and when they do they represent the same abstract/semantic characters. Some different encoding share names across platforms but there is usually one of their name that is unique enough if that were a concern.

Let’s see how things work on windows 10 first, adding a second or third implementation (presumably
one based on MultiByteToWideChar( is probably not worth the trouble because we would most likely have to do quite a lot of manual work - Windows does not provide an easy way to map encoding names to code page enums, in particular.

But generally I agree with you. We should answer these question before pursuing any pull request.
I don’t think it’s a huge amount of work.
Researching how we can portably link to ICU, maybe talk to a few package maintainer and see how it behaves on windows seem to be good next steps.

Speaking as a packager of LLVM’s Windows binaries and Chromium’s toolchain, any new dependency is typically a fairly large pain. Sometimes they’re worth it, and I think that’s the most important question here.

Another important question is what impact this would have on Clang’s performance.

1 Like

For z/OS, we wish to support more than just EBCDIC but don’t have ICU support which is why we want to also add the iconv support. But in general, I’m happy with using ICU as the default when it’s available.

Will icu4c be a strong dependency? I cannot build Clang without icu4c? Apple ships the dylib but no headers. I have to install it via brew.

1 Like

The less we make the maintenance burden, the better the feature looks.

Excellent, thank you!

Less implementations are better. :slight_smile: I was thinking of MultiByteToWideChar mixed with IMultiLanguage::GetCharsetInfo.