This only works on some posix-like system, and clang strives to work on Windows (and OSX, but I can’t imagine this feature being useful there). This need to be addressed or at least acknowledged - and if there is community consensus to go in that direction, that’s fine.
asm(“…”)
This isn’t clear to me, it should be in whatever encoding the linker exist, shouldn’t it?
IE, what will lld do there?
typeinfo name
Shouldn’t that be the execution charset?
#include "fn" or #include <fn>
Here be dragons. IE, by converting to UTF-8 before lexing, we might alter the byte representation when converting back, such that we can’t open the files.
I don’t believe there is any way to sanely handle this hypothetical. Let’s pretend this is not a actual issue,
people really should not be creative with their file names.
user-defined literal
I believe these should be kept in UTF-8.
attribute args (GNU, GCC)
This is not actually entirely clear to me. some attributes do expect a string literal in which case it’s UTF-8.
But some expect an evaluated string, in that case it will be in the execution encoding (ie fexec charset)
Currently, we see no other way than to reverse the translation, disable this optimization or stop certain translations when the string is assumed to be encoded in UTF-8 to resolve these complexities. Although reversing translation may not yield the original string, it can be used to locate format specifiers which are guaranteed to be correctly identified.
We could keep around the original (UTF-8) string, but all the encodings C++ support round trip in a semantic preserving (but not necessarily byte-preserving) way through Unicode, so all use cases we want to support should work.
There might be issues with the fact we don’t have normalization facilities in clang yet, so we can consider scenarios in which round tripping through a different encoding would affect identifiers, but there is currently no scenario in which this would be an issue.
In the future, we will need to convert execution encoding strings into UTF-8 for example in the context of reflection in C++ (should that materialize), or as part of https://isocpp.org/files/papers/P2741R3.pdf ( accepted C++26 feature)
Note ⚙ D105759 Implement P2361 Unevaluated string literals, which intends to clarify some of these encoding questions (sadly, not all of them).
Globally, I agree with the general architecture, however we really do need to resolve the iconv related questions before progressing this.
Would defaulting to ICU and falling back to iconv on platforms where ICU is not available (mostly z/OS) be something the community be fine with?