Clang's string type?

I was looking over the code for Clang’s format specifier checking (Primarily the FormatStringHandler class) and I noticed that the issue is that everything is built on standard C strings.

Does Clang/LLVM even have a string class that supports Unicode (at least UTF-16, and hopefully UTF-32)?

In order to support wprintf, we’d have to support UTF-16 in the API, which would amount to a massive patch set for pretty much everything.

and templates wouldn’t even work, because they are “lowered” at compile time, not runtime, and the last thing anyone wants is 2 seperate versions of Clang, one that supports UTF-8 (and ANSI codepages), and another that supports UTF-16/wchar.

Not specifically. You can use ArrayRef/SmallVector to store “strings” if necessary. And include/llvm/Support/ConvertUTF.h has conversions. I don’t follow; can’t you just convert the format string from UTF-16/UTF-32 to UTF-8 before checking it? (Granted, that’s not particularly efficient, but it’s rare enough that it probably doesn’t matter.) -Eli

Hi Marcus,

In order to support wprintf, we'd have to support UTF-16 in the API, which would amount to a massive patch set for pretty much everything.

Similar issues came up earlier this year when people discussed EBCDIC
(yay!) support:
https://lists.llvm.org/pipermail/cfe-dev/2018-January/056721.html

The impression I get from that is similar to Eli's suggestion: Clang
likes UTF-8 internally.

Cheers.

Tim.

Thanks for the link to that thread Tim.

Eli: “I don’t follow; can’t you just convert the format string from UTF-16/UTF-32 to UTF-8 before checking it? (Granted, that’s not particularly efficient, but it’s rare enough that it probably doesn’t matter.)”

and I realized a bit after posting this that converting the format strings from UTF-16/wchar, to UTF-8 would probably be the best way to achieve this Eli.

I’m just not sure how I’d handle the type matching, do you know when that happens in comparison to when the string/character literals would be converted? would that get in the way, or get messed up?

In the clang AST, a string literal is represented as an array of integers of the appropriate width; the lexer converts from UTF-8 to UTF-16 or UTF-32 at the same time it resolves escapes. (This is necessary to compute the length of the string, which is part of the string literal’s type.) You can check the width of the characters in a string using StringLiteral::getCharByteWidth(). It’s 1, 2 or 4, depending on whether it’s UTF-8, UTF-16, or UTF-32. You can read individual characters from that array using StringLiteral::getCodeUnit(). Or you can grab the whole array using StringLiteral::getBytes() (note that the return type here is a bit misleading). Actually, you might not want to use a real UTF-16 to UTF-8 conversion; maybe better to translate all non-ASCII bytes to 0xFF or something. Not that it really affects the parsing, but it probably makes translating back to a source location along the lines of StringLiteral::getLocationOfByte easier. -Eli

You can read individual characters from that array using StringLiteral::getCodeUnit()

By “characters” do you mean a complete Unicode codepoint? because a codepoint is just a byte for UTF-8, or just a short for UTF-16.

Let’s say that I have a string in clang somewhere that is :lion: (the Lion emoji).

in UTF-32 aka the Unicode Scalar Value for the Lion Emoji is: U+1F981.

in UTF-8 the Lion emoji would be encoded as: 0xF0 0x9F 0xA6 0x81.

In UTF-16(BE) the Lion emoji would be encoded as: 0xD83E 0xDD81.

In UTF-16(LE) the Lion emoji would be encoded as: 0x3ED8 0x81DD

So what exactly does GetCodeUnit return?

For example, if I did that on a string that had a Surrogate Pair, would I get the Unicode Scalar Value?

Oh, no, sorry, like the name suggests, it returns a code unit. -Eli