[RFC][libc] wctype header implementation

Co-author: @bassiounix

Motivation & Background

The wctype header is not currently implemented in LLVM libc. This is a blocker for building libc++ against LLVM libc.

What needs to be implemented:

  • Wide character classification functions (iswalpha, iswdigit, etc.)
  • Wide character conversion functions (towlower, towupper, etc.)
  • Types and macros (wint_t, wctype_t, etc.)

The main crux of the implementation is coming up with an efficient way to represent
the required classification and conversion functions, with full C.UTF8 support (1,114,112 codepoints) without bloating the binary size.

Goals

A standard compliant C.UTF8 and C locale implementation. Support for other locales is not currently planned.

Classification functions

For each code point, we can define a bitmask where each bit would represent a Unicode category, like alpha, digit, upper, etc. There are 12 categories required for the wctype implementation, so this would fit in 2 bytes.
We could store bitmasks in some kind of lookup table. However, not all properties need a lookup table, as some properties are composite ones (alnum, graph), and some properties can be hardcoded for C.UTF8 (for example, in C.UTF8 only ASCII digits are considered digits), so in practice we can get away with using a single byte flag.

Proposed solution: Two stage lookup table

We can take advantage of the clustered nature of Unicode assignments, where contiguous ranges of code points frequently share the same classification properties and don’t store a mask for each property, but for blocks of properties.

Index array: The 21 bit code point is split. The upper 13 bits would index into the index table, where each entry would represent a block of 256 consecutive characters. The value stored here is an index into the block table.

Block array: This array contains the actual property masks. Crucially, the identical blocks are deduplicated. If multiple 256-character ranges in Unicode have the exact same properties, they’d all point to the same entry in the block array.

  • Properties to be stored in the lookup tables
  • upper, lower, alpha, space, print, blank, control, punct
  • Properties that don’t need a lookup table
  • alnum: composite, can be defined as alpha && digit
  • graph: composite, can be defined as print && !space
  • digit: in C.UTF8, only ASCII digits are considered digits, can be hardcoded
  • xdigit: only few possibilities, can be hardcoded

These arrays could be generated at build time and could be included in the binary itself.
According to our prototype implementation, this would increase the code size by ~46KB.
This approach would provide constant time lookup for the classification functions.

We can also implement fast paths for ASCII cases for all properties, skipping the need for a lookup table in those cases.

Benchmarking this against glibc is showing this implementation to be an order of magnitude faster.

Alternatives

We could use a similar approach to glibc, and store the lookup tables on disk which would then be loaded on locale change, and get memory mapped into memory to be used in the actual wctype function calls. This would lead to a smaller binary size and could support locale switching at runtime.

However, it would have a lot of drawbacks. It’d introduce a runtime dependency on the locale files, which we’d have to provide separately from the library itself, and there’s also the overhead of the initial loading from disk.

We propose that we include the lookup tables in the library itself, and that this increase of the binary size by ~46KB is acceptable, it’s fine in server environments, and for size constrained environments like embedded, we could provide a flag for turning this feature off.

Conversion functions

There are two main functions we need to implement here: towlower and towupper. There are multiple options here, depending on if we want to prioritize speed vs size.

Proposed solution: hash map

The most straightforward approach is to store the conversions in a hash map, where the key is the input character and the value is its upper/lower case mapping. We have around 2800 entries for upper and lower case mappings, leading to around 24.5KB increase in binary size, given that we store those mappings directly in memory with an efficient implementation of a hash map. (probably a constexpr, flat hash map with probing)

This approach is shown to be the fastest of all 3 approaches we tried, while being 2nd best in binary size bloat. A possible optimization for the size would be to store the conversions on disk
and load them at runtime, with some form of compression to reduce disk I/O. However, this would increase the moving parts and dependencies of the library (like what to do if the file is not found and where to store/manage it). So we propose to ship the data directly with the library binary for convenience.

Alternative Solution #1: sorted array and binary search lookup

We could store the conversions in a sorted array, and do binary search on lookup.
We’d need two arrays, an array mapping from lower to upper and vice versa. There are roughly 2800 codepoints that have a conversion, and each mapping would be 8 bytes (wint_t → wint_t), so this would end up being ~23KB.

Lookup would be logarithmic, but it might be acceptable as the arrays would be of size ~1400 and we can also implement fast paths for ASCII cases for all properties, skipping the need for doing binary search.

While this approach is shown to be the most optimal for size, it seems to be ~3 times slower in non ASCII cases compared to glibc.

Alternative Solution #2: two stage lookup table

We could use a similar two-level table approach as the classification functions.
The mappings would be stored in 256-entry blocks indexed by the upper 8 bits of the codepoint. Since case mappings are sparse, most blocks are empty and can share a single zero-filled block. This would provide constant-time lookup, but the prototype implementation shows this to be around ~50KB in size.

This is shown to be the same in speed as the hash map implementation, while being twice the size.

Build integration

We propose that the generation scripts, which can be written in any language (we’re planning on using C++ though), should not be included in LLVM libc’s build system. The generation scripts will be checked into the repository, but they should be run manually on Unicode version upgrades, and not on every build, and only the generated files themselves will be included in the build.

Benchmarks

Benchmark results can be seen here.
Might not be totally accurate and representative, only quick prototyping was done.

Prototype code

https://github.com/mleleszi/wctype_prototype

Conclusion

We propose a two-level lookup table for wide character classification and a hash map for case conversion functions. These structures should be stored within the library, resulting in an estimated size increase of around 70KB. To address resource constrained environments, this feature should be optional and configurable at build time.

1 Like

It is possible to disable support for wide characters and locales in libc++. Is this still a blocker with those disabled?

I think we can build libc++ with localization/wide chars disabled, but that effectively disables libc++ streams.

Maybe @michaelrj-google or @vonosmas can comment on this

Yes, currently that disables streams. We’re (currently not super actively) working on removing that requirement though. If you only care about the streams and not actually localization that may be a more straight-forward path.

1 Like
  • Today we can build libc++ against llvm-libc with localization enabled and wide-chars disabled (llvm-libc provides some stubs for things for isalpha_l ). That’s fine, but having wide-strings is important for basic things like Abseil and definitely for Clang.
  • We can almost build libc++ against llvm-libc with wide-chars enabled and localization disabled. The only missing function (on Linux) is swprintf, adding it is not trivial, but possible, there’s some ongoing work underway. But, as you mention, this would disable string streams, which is even more important. @philnik - would you recommend actively pursuing re-work of libc++ to support basic streams functionality even in absence of localization?
  • If we need both localization and wide-chars, then we need <wctype> methods.

If you are interested in that functionality, then I’d definitely recommend it. I don’t think it should be that complicated, since we’ve already factored out most of the localization functions into the locale base API. You’d basically just have to add a new entry to the list of platforms and possibly refactor the `mbstate_t` stuff to be part of the locale base API.

@michaelrj-google @vonosmas
What are the priorities here? Are we primarily trying to unblock building libc++ against llvm-libc, or do we want a UTF8 wctype implementation regardless?
How useful would a localization disabled libc++ be for downstream projects like clang or abseil? If it’s sufficient for those use cases, I’d happily to work on the libc++ changes @philnik mentioned before diving into the wctype implementation.
We probably want a UTF8 wctype implementation eventually anyway, just not sure about the priorities here.

Thanks for the RFC and the discussion. We definitely want a complete unicode wctype implementation at some point to unblock building a wider range of targets with LLVM-libc, but the first thing we need to unblock is libc++. I think most programs don’t directly use wide characters so improving the status of libc++ without localization would be a good step, and we need it to avoid the use of locales anyways.

Given that there are multiple people working on this effort it might be best to do both in parallel, one way would be to have @mleleszi work on the libc++ support while @bassiounix works on the wctype implementation in LLVM-libc. Does that sound good to you?

Sounds good to me.

Based on yesterday’s libc++ meeting, @ldionne will be working on enabling streams to work without localization. Regarding the wctype implementation in libc, we’ll move forward with our plan and start implementing if there are no further comments or concerns.

“enabling streams to work localization” this will be helpful for me, and maybe a few others from what I saw on libcxx’s discord channel

1 Like

@mleleszi @bassiounix I’m curious if you’re familiar with the V8 implementation (see v8/src/strings/unicode.cc at 6d61600863e8e963bd3caf54c56ec6a5f1de06a6 · v8/v8 · GitHub)? This is one of the most efficient and compact implementation I’m familiar with and I’m wondering how does it compare to your proposed approach. Would it be possible to include it in your benchmark?

I checked V8’s implementation, looks like it uses a chunked binary search approach. I’ve added some benchmarks, and unless I’ve messed something up, it appears to be significantly slower (which I think is to be expected for binary search). I couldn’t directly compare the size as they don’t implement all the properties we need to, but I assume their implementation is more compact. I think that’s acceptable though, as we’ll provide a flag to use an ASCII only implementation instead of the full Unicode one for size constrained environments.