Implementing charsets (-fexec-charset & -finput-charset)

Hi, I have been investigating how to implementation -finput-charset and -fexec-charset (including -fexec-wide-charset too). I noticed some discussions a couple years ago and was planning on taking the same approach. In a nutshell, change the source manager to convert the input charset into UTF-8 and do the parsing using UTF-8 (eg. in Lexer::InitLexer()). I would convert strings and character constants into the exec charset when a consumer asks for the string literal. This seems like a sound concept but there are many details that need to be ironed out. The clang data structure is filled with all kinds of strings (i.e file names, identifiers, literals). What charset should be used when creating the clang AST’s? Should getName() return the name in UTF-8 or an output charset?

While looking into this I realized that we need one more charset. We have the input charset for the source code and exec charset for the string literals. And we have the an internal charset (UTF-8) for parsing. But we also need to have a charset for things like symbol names and file names. These do not use the input or internal charsets. For example, on MVS the user may say the input charset is ASCII or UTF-8 but actual file names and symbol names in the object file need to be EBCDIC. The same would be true for alternative charsets on Linux. A code point in the a locale other that en_US may map to a difference code point in the file system. The other charset is the output charset. This is the charset that symbol names in the object file should use as well as the charset for file names.

We also need to consider messages. The messages may not be in the same charset as the input charset or internal. We will need to consider translation for messages and the substituted text.

I have thought about some of these issues and would like feedback and/or suggestions.

  1. Source file names
  • We’d store these in the SourceManager in the output charset.
  • When preprocessing (#include, etc) we would convert the file names into the output charset and do all file name building and system calls using the output charset
  1. Identifiers
  • I think the getName() function on IdentifierInfo and similar functions should return the name in the output charset Too many places, even in clang, use the getName() functions and would need to apply a translation if we didn’t do this
  • We need some way to make parsing quick since identifiers will start off in UTF-8 and we won’t be able to use getName() to look up identifiers any more. I was thinking about adding a getNameInternal() that would return the UTF-8 spelling and would be used in the hashing.
  1. String literals & Character constants
  • these are converted to the exec charset and stored in the clang structure in the translated format
  1. Messages & trace information
  • Going to assume the charset for messages is a variation of the output charset.
  • All substitution text should be in or converted into the output charset before generating the diagnostic message.
  • trace/dump output will be in the output charset too.
  1. Preprocessed output (including the make depndency rules)
  • All preprocessed output will be in the output charset

Thanks

And I thought Windows was complicated…

Debug info (at least for DWARF) wants strings to be UTF-8; filenames, type/variable names, all that stuff. Just to throw that out there.

–paulr

UTF-8; introducing another charset into the AST seems confusing for no benefit, given we’re converting the input source code to UTF-8 anyway. The only place we need to translate symbol names to the execution charset is IR generation. The charset for symbol names has to be the same as -fexec-charset for any target that has an API like dlsym(). File names do use a different charset, but LLVM has a file system layer which abstracts that, so clang should pretend the filesystem is UTF-8. (On Windows, we convert to UTF-16 before we call into the OS. On other systems, we currently just assume everything is UTF-8, but we could change that if you need to run the compiler on a system where that doesn’t hold.) Messages should be UTF-8 until we have to convert them for output. (An IDE always wants UTF-8. For a console/stderr, we probably need some conversion, but IIRC that isn’t implemented at the moment.) Proprocessed output needs to be in the input charset; otherwise the compiler can’t consume the result (for example, -save-temps would break). -Eli

UTF-8; introducing another charset into the AST seems confusing for no benefit, given we’re converting the input source code to UTF-8 anyway. The only place we need to translate symbol names to the execution charset is IR generation.

+1

While looking into this I realized that we need one more charset. We have the input charset for the source code and exec charset for the string literals. And we have the an internal charset (UTF-8) for parsing. But we also need to have a charset for things like symbol names and file names.

The charset for symbol names has to be the same as -fexec-charset for any target that has an API like dlsym().

File names do use a different charset, but LLVM has a file system layer which abstracts that, so clang should pretend the filesystem is UTF-8. (On Windows, we convert to UTF-16 before we call into the OS. On other systems, we currently just assume everything is UTF-8, but we could change that if you need to run the compiler on a system where that doesn’t hold.)

+2

We also need to consider messages. The messages may not be in the same charset as the input charset or internal. We will need to consider translation for messages and the substituted text.

Messages should be UTF-8 until we have to convert them for output. (An IDE always wants UTF-8. For a console/stderr, we probably need some conversion, but IIRC that isn’t implemented at the moment.)

Proprocessed output needs to be in the input charset; otherwise the compiler can’t consume the result (for example, -save-temps would break).

+3 :slight_smile:

-Chris

Any translation of file names is risky due to round-trip issues and
well-formedness requirements. For example, Shift-JIS defines many code
points that don't round-trip through Unicode [1]. And most POSIX
systems don't require file names to adhere to any particular encoding
leaving proper interpretation dependent on locale settings.

I don't know of particularly good solutions for these issues. Things to
think about:
- Should file names in #include directives be transcoded with the source
   file? (I think so, though this will break attempts to compile, for
   example, EBCDIC source files on systems where referenced file names
   are stored as UTF-8; I'm not sure how gcc handles this).
- How should file names on command lines be interpreted? (I think
   either no translation, or according to the current locale; on Windows,
   the wide command line should be processed).
- How should file names in environment variables be interpreted? (I
   think either no translation, or according to the current locale; on
   Windows, the wide environment variable values should be processed).

I think it is reasonable to not support all file names, but if so,
limitations should be documented. For starters, how file names are
interpreted in various contexts (command line, env vars, #include,
config files, etc...) should be documented. Next, limitations such as
no support for code points that don't round-trip through Unicode, or no
support for file names that are not well-formed for the encoding they
are interpreted with should be documented.

Tom.

[1]:
https://support.microsoft.com/en-us/help/170559/prb-conversion-problem-between-shift-jis-and-unicode

Outside of Windows and macOS, filenames on most filesystems can be any arbitrary sequence of bytes followed by NUL — there could be one encoding, many encodings, or no valid encoding for the bytes in the filenames in a given filesystem.

I’m supportive of requiring these filenames to be encoded in UTF-8, but I believe currently Clang allows non-UTF-8 filenames in.

As part of this work, I’d advocate making it an error to try to compile (or preprocess, or link, etc.) any file whose name is not encoded in UTF-8.

Ben

clang and llvm aren’t performing any conversion right now. Everything assumes the input, output and exec charsets are UTF-8. One user scenario I am trying to enable is the input charset being EBCDIC for a system where EBCDIC is the charset. Doing this is non-trivial and exposes the issues I outlined below and most likely more (eg. debug info).

The clang source is filled with code that is equivalent to “ch==‘A’” or " ch>=‘A’ && ch<=‘Z’, where ch is a character from a source file. These examples have many assumptions. Some of these are:

  • If the code is compiled on an EBCDIC system the character literal ‘A’ and ‘Z’ will be in EBCDIC.
  • the source files are also in EBCDIC.
  • that the code points from ‘A’ to ‘Z’ are contiguous. That is not the case on EBCDIC.
    We can solve many of the problems converting the source files into UTF-8 from EBCDIC and changing the comparisons above to “ch== u’A’” and “ch>= u’A’ and ch <= u’Z’”. Hence the internal charset is now really UTF-8.

With no other changes, the getName() family of functions in the clang AST will return identiefer names in UTF-8. This doesn’t look very pretty or helpfull in error messages or the internal dump functions. When the input charset is not from the same family as UTF-8 (i.e. ASCII based) then you need to preform a lot of conversions to get useful output from the compiler. I’m proposing to have the getName() functions return the declaration name in the output charset so we get intelligible output.

The same issue happens with include file names. The llvm support library provides a layer of encapsulation around files but it does not translate the file names from UTF-8 to the system charset (eg. EBCDIC - IBM-1047). It just uses the file names as they are. In addition the clang code, in the preprocessor, adds the “./” search path if needed. The code points for these characters have to be considered wisely so we don’t mix charsets when constructing file names. We need to convert the internal UTF-8 to the output charset (eg. EBCDIC) before trying to open files or displaying them in error messages. The easiest solution for this is to store the file names in the source manager in the output charset. That will avoid the need to translate the names every time they are used.

On an EBCDIC system, the messages will be in EBCDIC, not UTF-8. All output is expected to be in EBCDIC. We can’t assume UTF-8 is used everywhere.

I had thought that the IR generator was the primary spot we would have to do translation. After prototyping and looking at clang, I found that most of the places that would need translation were in clang itself. Clang generates things like messages, dumps, secondary output files (eg. make depnd), etc. These all need to translate symbol and file names before generating the output. As well, there are all of the tools that use the clang AST. If the clang AST stored names in UTF-8, these would need to translate them too. I think the easiest design is to store the symbol names and the file names in the correct charset and not force the consumer to do a conversion.

graycol.gif

Please don’t mix together the issues of compiling for an EBCDIC target and running LLVM on an EBCDIC host. I understand it’s kind of tied together from your perspective (since the end result you want is a native compiler which runs on an EBCDIC target), but LLVM is always built as a cross-compiler, so we need to consider them separately to get a reasonable result. If you’re cross-compiling UTF-8-encoded source code on a UTF-8 host to a EBCDIC target, you need conversions in a few places in clang: specifically, symbol names need to be translated when IR is generated, and string/character literals need to be translated by the lexer. And the LLVM backend might also need to convert certain strings which are emitted into object files. If you’re cross-compiling EBCDIC-encoded source code on a UTF-8 host to a UTF-8 target, you need a conversion in exactly one place; the input source code needs to be converted to UTF-8, once. If you’re cross-compiling EBCDIC-encoded source code on a UTF-8 host to a EBCDIC target, you need both of the above conversions. If you’re compiling LLVM/clang for an EBCDIC host, everything becomes complicated because both LLVM and clang assume they’re running in a ASCII-compatible locale; the issues you’re describing are primarily related to this. You probably want to leave this for last because a lot of the changes involved will be controversial, and it’ll be easier to convince everyone it’s useful if you have a usable target.