[RFC] Project Widen Your Char-izons (Adding widechar support to LLVM-libc)

This was written by @uzairnawaz and @sribee8, who are my interns at Google for the summer.

Objective

LLVM-libc should support C standard wide-character functions by implementing the conversion between wide and multibyte characters.

Background

Wide characters are a standard feature of C standard library implementations that provide an API for working with characters larger than a byte (ex: emojis). This API includes basic string utilities (getting the length of a string, searching, concatenating, etc) as well as conversion functions between different character encodings (our project focuses mainly on UTF-8 and UTF-32). Wide character functionality is important for several Google-internal customers, and is not currently available in LLVM-libc.

The two main ways to represent the extended set of characters are through either variable-length character encodings or wide characters. UTF-8 is an example of a variable-length encoding which represents characters using anywhere from 1-4 bytes. The specific values of certain bytes encode data about the number of bytes for a given character. These ranges are non-overlapping, making it easy to detect if you are starting from the middle of a series of bytes representing a character. Wide characters are another approach which involve representing each character as a 4 byte value, making random access possible at the cost of compactness.

The specific technical details of these encodings can be found through the following links:

Design

Overview

We decided to implement wide characters as 32 bit integers using the UTF-32 encoding since this is pretty standard.

Adding wide character support to LLVM-libc involves implementing a collection of functions that have a standardized interface and expected behavior. The documentation for these functions can be found here: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3220.pdf

The collection of wide character functions can be classified into three main categories: string utilities, conversions, and I/O.

String Utilities

This includes functions like wcslen (get the length of a widechar string) or wcscat (concatenate two widechar strings). These functions are the wchar* version of their string counterparts.

Conversions

Since different systems and programming languages can use different character encodings, conversion functions alongside wide character functions can bridge this gap and allow for information to be correctly processed. For example, multibyte characters can be used to represent a particular character that is not part of the basic character set, however, strlen may have unexpected results as it returns the number of bytes in the string. Standard functions such as printf also expect to operate on char* data, so it is necessary to convert wide characters to multibyte characters. Therefore, the conversion between the two provides an interface that allows for more accurate and predictable behavior with wide characters.

I/O

Useful programs tend to interact with other processes/users in order to complete some task. The standard library provides I/O utilities in order to do so, such as printf and scanf. The implementation of these functions will likely involve a combination of calling into the existing byte-oriented functions and our newly implemented conversion utilities.

Detailed design

libc/src/__support/wchar will contain our internal functions relating to wc→mb and mb→wc.

This will include the following internal functions:

The rest of the functions will rely on our internal functions and mbstowcs and mbsrtowc as well as wcstombs and wcsrtomb will share an internal implementation.

** Lighter colors represent the internal functions

Restartable Functions

Restartable conversion functions allow users to be able to perform partial conversions on a string, allowing them to break it up into smaller steps. However, if a function ends while in the middle of processing a multibyte character (that is, it doesn’t end cleanly on a character boundary), the next function call will need to be able to access that partial state.

The C standard defines a mbstate_t struct that is used to represent the current state of a string conversion. This struct is opaque and implementation-dependent, so we will need to decide what information is relevant to include. For example, consider the following code:


wchar_t wc_string[1];

mbstate_t mbs;

size_t ret = mbrtowc(wc_string, &mb_str, /* # of bytes read from mb_str */ 1, &mbs);

ASSERT(ret == (size_t)(-2)); // we have performed a partial conversion, and the data we need to remember is stored in mbs

// we want to continue the conversion, picking up where we left off

ret = mbrtowc(wc_string, &mb_str + 1, /* # of bytes read from mb_str */ 3, &mbs);

ASSERT(ret == 3); // 3 bytes were needed to complete the conversion

ASSERT(wc_string[0] == 0x0001F921); // utf-32 representation of the emoji

Since mbrtowc is called with a parameter of one, only one byte of the multibyte character will be read. mbs (the mbstate_t variable) will therefore need to store the first byte and store information about how many total bytes of the multibyte character have been read/written. This will allow for proper return values indicating whether the conversion is fully done/valid or incomplete/invalid.

Note: For non-restartable functions we do not need to pass an mbstate_t variable since they do not have the ability to pick up in the middle – if it was not passed with the correct n (number of bytes to read), it would have to start from the beginning on a separate call. The nonrestartable functions can call their restartable counterparts with a null mbstate.

Internal State Class

To simplify implementation for future conversions to other encoding formats (ex: UTF-16), we want to use a class that will hold both the mbstate_t struct and the internal functions to access the data.

What we will store in our mbstate_t struct:

  • 4 bytes representing a partial conversion to UTF-32
  • An 8 bit int to hold the number of bits processed
  • An 8 bit int to hold the total number of bytes in the character

How will it work?

We will create a queue-like internal interface with push and pop functions to simplify conversions. These functions will be overloaded to handle different character encodings.

  • Push will convert and push bits into mbstate_t
    • Since it takes multiple UTF-8 bytes to represent an entire UTF-32 character, we will maintain a “partial” representation of the UTF-32 encoding of the character defined by the UTF-8 bytes processed so far
  • Pop will return the character information in the specified encoding format
struct mbstate_t {
    wchar_t partial; // the partial encoding of the character in utf-32
    uint8_t bits_processed; // # of bits that are represented within partial
    uint8_t total_bytes; // # of bytes needed to represent the full character
}
class CharacterConverter {

    mbstate_t* state;
    bool isComplete();

    // int return type will give error codes
    int push(char utf8_byte);
    int push(wchar_t utf32);

    utf_ret<char> pop_utf8(); // utf_ret contains a type T and an int for error code
    utf_ret<wchar_t> pop_utf32();
}

internal::mbrtowc Pseudocode

size_t mbrtowc(wchar_t* out, const char* mbs, size_t nbytes, mbstate_t* state):
    CharacterConverter charConv(state);
    for each byte in mbs -> mbs + nbytes:
        int success = charConv.push(byte)
        if success == -1:
            return success
        if charConv.isComplete():
            charConv.pop_utf32(out)

wcrtomb is going to be implemented in a similar way, using the mbstate_t variable to track the current state of the conversion.

Potential for UTF-16

We would like to eventually have support for UTF-16 conversions since UTF-16 is used by other systems, both as a multibyte-like format and as a wchar_t type (e.g., on Windows). To create this potential, we will hold our mbstate_t struct within an internal class which will function as an intermediate state when converting from UTF-8 to UTF-32 (and vice versa). Since we are managing the data through overloaded functions, in the future, the class simply has to add overloads for push and pop functions for conversions for UTF-16.

Alternatives considered

mbstate_t struct (bits vs bytes):

For our mbstate_t struct, we initially considered having a stand-alone struct that contained an array with 4 bytes to hold bytes as they were read in. Another option for the stand-alone struct was storing 3 bytes and the number of bytes that were read in, however, this would not have been able to hold a complete state which could have ruined some of the class functionality. After deciding on keeping mbstate_t within a class, it made more sense to have a partially converted character within the struct as well as the number of bits that were converted so far to keep track of an intermediate state therefore making conversions between other encodings such as UTF-16 simpler. Keeping it stored as bits rather than bytes also makes it easier for the conversion to UTF-32 since after completely reading in the multibyte character we can left shift the 32 bit value once rather than left shifting while converting. The bit number paired with the total number of expected bytes would also make it clear when things were fully converted/read in.

Internal Functions:
We were considering not having the additional internal functions for mbsrtowcs and wcsrtombs and instead repeatedly calling the character conversion functions for all the functions that had to do with reading an entire string. However, we decided that it would be easier to have these functions call their respective internal functions that can convert an entire string since we would have most likely ended up copying this code for each function.

FWIW I think in general it makes sense.However, have you considered using the high bits of the UTF-32 character to store the number of bits processed and total number of bytes? The total number of processed bits can only be 21 IIUC, which would fit into five bits and the maximum number of bytes processed can at most be 4 or so, which would easily fit into the remaining 6 bits. This would make the whole struct fit into four bytes, which would make the “store as UTF-32” strictly superior to the “store the unprocessed bytes” version.

This is actually something that we considered! We probably should’ve mentioned this in the tradeoffs section, but we ultimately decided against this to future-proof in the case that the UTF-32 ever requires more bits beyond the 21 currently needed. We also didn’t feel like the downside of an extra two bytes was significant enough to justify this added complexity, but I’d be curious to hear if you disagree/have any additional thoughts.

1 Like

Given that there are currently something like 150k Unicode characters and 21 bits is enough for something like a million characters, I don’t think we’ll run out any time soon. If you think the future-proofing is worth probably doubling the size of the struct, I guess that’s fine too. At that point you might want to go all the way and add an extra two bytes to the character representation.

Does the size of this struct matter in practice? I wouldn’t have expected an application to ever have more than one per thread at most, so saving space by using extra bits wouldn’t seem worth the time cost of doing the extra bitwise arithmetic, regardless of the future-proofing. Is there a use-case I’m missing?

bitwise arithmetic is so fast on modern CPUs that compressing your data is almost always worth the extra instructions. In this case it could be a non-negligible win on 32 bit platforms, since we can use a single register to pass the struct instead of the stack.

All the libc functions that take mbstate_t take it as a pointer, so it shouldn’t matter if it fits in a register.

That doesn’t really matter, does it? You can just have a function in the header that forwards to a libarary-internal function to avoid the memory access.