This was written by @uzairnawaz and @sribee8, who are my interns at Google for the summer.
Objective
LLVM-libc should support C standard wide-character functions by implementing the conversion between wide and multibyte characters.
Background
Wide characters are a standard feature of C standard library implementations that provide an API for working with characters larger than a byte (ex: emojis). This API includes basic string utilities (getting the length of a string, searching, concatenating, etc) as well as conversion functions between different character encodings (our project focuses mainly on UTF-8 and UTF-32). Wide character functionality is important for several Google-internal customers, and is not currently available in LLVM-libc.
The two main ways to represent the extended set of characters are through either variable-length character encodings or wide characters. UTF-8 is an example of a variable-length encoding which represents characters using anywhere from 1-4 bytes. The specific values of certain bytes encode data about the number of bytes for a given character. These ranges are non-overlapping, making it easy to detect if you are starting from the middle of a series of bytes representing a character. Wide characters are another approach which involve representing each character as a 4 byte value, making random access possible at the cost of compactness.
The specific technical details of these encodings can be found through the following links:
Design
Overview
We decided to implement wide characters as 32 bit integers using the UTF-32 encoding since this is pretty standard.
Adding wide character support to LLVM-libc involves implementing a collection of functions that have a standardized interface and expected behavior. The documentation for these functions can be found here: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3220.pdf
The collection of wide character functions can be classified into three main categories: string utilities, conversions, and I/O.
String Utilities
This includes functions like wcslen (get the length of a widechar string) or wcscat (concatenate two widechar strings). These functions are the wchar* version of their string counterparts.
Conversions
Since different systems and programming languages can use different character encodings, conversion functions alongside wide character functions can bridge this gap and allow for information to be correctly processed. For example, multibyte characters can be used to represent a particular character that is not part of the basic character set, however, strlen may have unexpected results as it returns the number of bytes in the string. Standard functions such as printf also expect to operate on char* data, so it is necessary to convert wide characters to multibyte characters. Therefore, the conversion between the two provides an interface that allows for more accurate and predictable behavior with wide characters.
I/O
Useful programs tend to interact with other processes/users in order to complete some task. The standard library provides I/O utilities in order to do so, such as printf and scanf. The implementation of these functions will likely involve a combination of calling into the existing byte-oriented functions and our newly implemented conversion utilities.
Detailed design
libc/src/__support/wchar will contain our internal functions relating to wc→mb and mb→wc.
This will include the following internal functions:
The rest of the functions will rely on our internal functions and mbstowcs and mbsrtowc as well as wcstombs and wcsrtomb will share an internal implementation.
** Lighter colors represent the internal functions
Restartable Functions
Restartable conversion functions allow users to be able to perform partial conversions on a string, allowing them to break it up into smaller steps. However, if a function ends while in the middle of processing a multibyte character (that is, it doesn’t end cleanly on a character boundary), the next function call will need to be able to access that partial state.
The C standard defines a mbstate_t struct that is used to represent the current state of a string conversion. This struct is opaque and implementation-dependent, so we will need to decide what information is relevant to include. For example, consider the following code:
wchar_t wc_string[1];
mbstate_t mbs;
size_t ret = mbrtowc(wc_string, &mb_str, /* # of bytes read from mb_str */ 1, &mbs);
ASSERT(ret == (size_t)(-2)); // we have performed a partial conversion, and the data we need to remember is stored in mbs
// we want to continue the conversion, picking up where we left off
ret = mbrtowc(wc_string, &mb_str + 1, /* # of bytes read from mb_str */ 3, &mbs);
ASSERT(ret == 3); // 3 bytes were needed to complete the conversion
ASSERT(wc_string[0] == 0x0001F921); // utf-32 representation of the emoji
Since mbrtowc is called with a parameter of one, only one byte of the multibyte character will be read. mbs (the mbstate_t variable) will therefore need to store the first byte and store information about how many total bytes of the multibyte character have been read/written. This will allow for proper return values indicating whether the conversion is fully done/valid or incomplete/invalid.
Note: For non-restartable functions we do not need to pass an mbstate_t variable since they do not have the ability to pick up in the middle – if it was not passed with the correct n (number of bytes to read), it would have to start from the beginning on a separate call. The nonrestartable functions can call their restartable counterparts with a null mbstate.
Internal State Class
To simplify implementation for future conversions to other encoding formats (ex: UTF-16), we want to use a class that will hold both the mbstate_t struct and the internal functions to access the data.
What we will store in our mbstate_t struct:
- 4 bytes representing a partial conversion to UTF-32
- An 8 bit int to hold the number of bits processed
- An 8 bit int to hold the total number of bytes in the character
How will it work?
We will create a queue-like internal interface with push and pop functions to simplify conversions. These functions will be overloaded to handle different character encodings.
- Push will convert and push bits into
mbstate_t- Since it takes multiple UTF-8 bytes to represent an entire UTF-32 character, we will maintain a “partial” representation of the UTF-32 encoding of the character defined by the UTF-8 bytes processed so far
- Pop will return the character information in the specified encoding format
struct mbstate_t {
wchar_t partial; // the partial encoding of the character in utf-32
uint8_t bits_processed; // # of bits that are represented within partial
uint8_t total_bytes; // # of bytes needed to represent the full character
}
class CharacterConverter {
mbstate_t* state;
bool isComplete();
// int return type will give error codes
int push(char utf8_byte);
int push(wchar_t utf32);
utf_ret<char> pop_utf8(); // utf_ret contains a type T and an int for error code
utf_ret<wchar_t> pop_utf32();
}
internal::mbrtowc Pseudocode
size_t mbrtowc(wchar_t* out, const char* mbs, size_t nbytes, mbstate_t* state):
CharacterConverter charConv(state);
for each byte in mbs -> mbs + nbytes:
int success = charConv.push(byte)
if success == -1:
return success
if charConv.isComplete():
charConv.pop_utf32(out)
wcrtomb is going to be implemented in a similar way, using the mbstate_t variable to track the current state of the conversion.
Potential for UTF-16
We would like to eventually have support for UTF-16 conversions since UTF-16 is used by other systems, both as a multibyte-like format and as a wchar_t type (e.g., on Windows). To create this potential, we will hold our mbstate_t struct within an internal class which will function as an intermediate state when converting from UTF-8 to UTF-32 (and vice versa). Since we are managing the data through overloaded functions, in the future, the class simply has to add overloads for push and pop functions for conversions for UTF-16.
Alternatives considered
mbstate_t struct (bits vs bytes):
For our mbstate_t struct, we initially considered having a stand-alone struct that contained an array with 4 bytes to hold bytes as they were read in. Another option for the stand-alone struct was storing 3 bytes and the number of bytes that were read in, however, this would not have been able to hold a complete state which could have ruined some of the class functionality. After deciding on keeping mbstate_t within a class, it made more sense to have a partially converted character within the struct as well as the number of bits that were converted so far to keep track of an intermediate state therefore making conversions between other encodings such as UTF-16 simpler. Keeping it stored as bits rather than bytes also makes it easier for the conversion to UTF-32 since after completely reading in the multibyte character we can left shift the 32 bit value once rather than left shifting while converting. The bit number paired with the total number of expected bytes would also make it clear when things were fully converted/read in.
Internal Functions:
We were considering not having the additional internal functions for mbsrtowcs and wcsrtombs and instead repeatedly calling the character conversion functions for all the functions that had to do with reading an entire string. However, we decided that it would be easier to have these functions call their respective internal functions that can convert an entire string since we would have most likely ended up copying this code for each function.




