Handling Unicode : code points vs. code units

Hopefully the last question before I start posting some patches on this!

I think the big problem we face correctly handling extended characters in UTF-8 (and to a lesser extent UCNs in identifiers) is that much/all of the current code assumes that code points and code units are the same.

In Unicode terms, the character is called a code point, which corresponds to the 21 bit value representing a single character from the Unicode character set. This is a direct mapping in UTF-32 encodings, but may require multiple 'code units' in UTF8/UTF-16.

The effect shows up any time a character outside the basic 7-bit ASCII range shows up in a string literal or a comment. The column numbers for any diagnostic on that line will be wrong beyond that character.

Fundamentally, if we want to get this 'right' we should stop talking about characters and deal exclusively with character sequences. Once we are dealing with UTF-8 we can no longer assume a 'character' will fit into a single 'char' variable. I am not yet sure how pervasive such a change would be, as AFAICT most functions are already passing around pointers (in)to text buffers. The difference may be more in how we approach the code than in the source itself.

However, on place it really does matter is reporting column numbers in diagnostics. We need to report column numbers in terms of characters, or code positions, rather than code units as today.

In order to clarify when we are dealing explicitly with code positions, I propose to introduce a new class to describe such offsets rather than using a simple integer variable. Thanks to inlining this class should have performance equivalent to a regular integer apart from in a couple of use cases. The majority of the codebase should be unaffected, continuing to work in terms of byte offsets into a buffer. However, whenever we need to render a source-file column number we should go via this new type. The opaque type should catch many issues with code point vbs. Code unit at compile rather than runtime, although I don't have an exhaustive list of APIs that should be updated yet, so we must learn to alert for APIs using the wrong types as well.

A quick sketch of the class would look a little (or a lot!) like:

#include <cstddef>

struct CharacterPos {
   // We must declare a default constructor as there is another
   // user declared constructor in this class.
   // Choose to always initialize the position member. This means
   // that CharacterPos is not a POD class. In C++0x we might
   // consider using = default, which might leave position
   // uninitialized, although 0x is more fine-grained in its
   // usage of PODs and trivial operations, so explicit initialization
   // is probably still the best choice.
   CharacterPos() : position() {}

   // The remaining special members left implicit in order to
   // preserve triviality. In C++0x would explicitly default them.
// CharacterPos( CharacterPos const & rhs) = default;
// ~CharacterPos() = default;
// CharacterPos & operator=( CharacterPos const & rhs ) = default;

   // Constructor iterates string from start to offset
   // counting UTF-8 characters i.e code points.
   // Throws an exception if str is not a valid UTF-8 encoding.
   CharacterPos( char const * str, std::size_t offset );

   // Iterates str, returning a pointer to the initial code unit
   // of the UTF-8 character at 'position'.
   // Throws an exception if str is not a valid UTF-8 encoding.
   char const * Offset( char const * str ) const;

   CharacterPos & operator+=( CharacterPos rhs ) {
      position += rhs.position;
      return *this;
   }

   CharacterPos & operator-=( CharacterPos rhs ) {
      position -= rhs.position;
      return *this;
   }

   std::ptrdiff_t operator-( CharacterPos rhs ) const {
      return position - rhs.position;
   }

  bool operator==( CharacterPos rhs) const { return position == rhs.position; }
  bool operator<( CharacterPos rhs) const { return position < rhs.position; }
  bool operator<=( CharacterPos rhs) const { return position <= rhs.position; }
  bool operator!=( CharacterPos rhs) const { return !(*this == rhs); }
  bool operator>( CharacterPos rhs) const { return rhs < *this; }
  bool operator>=( CharacterPos rhs) const { return rhs <= *this; }

private:
   std::size_t position;
};

CharacterPos operator+( CharacterPos lhs, CharacterPos rhs ) {
   return lhs += rhs;
}

char const * operator+( char const * lhs, CharacterPos rhs ) {
   return rhs.Offset( lhs );
}

Note that two operations in here have linear complexity rather than constant:
   CharacterPos( char const * str, std::size_t offset );
   char const * Offset( char const * str ) const;

These are also the important APIs that define why the class exists.
In all other ways it should be a reasonable arithmetic type.

I am opting for pass-by-value rather than pass-by-reference-to-const as that is typically more efficient for small data types, although obviously I have no performance measurements to back that up yet.

Also note that these same two APIs have to deal with badly encoded UTF-8 streams, and indicate failure by throwing an exception. I have informally picked up that LLVM/Clang prefer to avoid exceptions as error reporting mechanisms. If this is likely to be an issue I would appreciate guidance on an alternate error reporting mechanism for those same APIs - especially the failed constructor.

AlisdairM

However, on place it really does matter is reporting column numbers in diagnostics. We need to report column numbers in terms of characters, or code positions, rather than code units as today.

That's not quite right; people don't consider column numbers in terms
of either code positions or code units; humans consider columns in
terms of the actual rendering. So, for example, combining characters
shouldn't usually be counted towards the column number.

In order to clarify when we are dealing explicitly with code positions, I propose to introduce a new class to describe such offsets rather than using a simple integer variable. Thanks to inlining this class should have performance equivalent to a regular integer apart from in a couple of use cases. The majority of the codebase should be unaffected, continuing to work in terms of byte offsets into a buffer. However, whenever we need to render a source-file column number we should go via this new type. The opaque type should catch many issues with code point vbs. Code unit at compile rather than runtime, although I don't have an exhaustive list of APIs that should be updated yet, so we must learn to alert for APIs using the wrong types as well.

This seems like overkill; there aren't enough places in the code that
care about column numbers. It should be sufficient to audit the
callers to getColumnNumber() and friends in SourceManager.
Essentially, there are three interesting notions of a column number:
the column number in bytes (what that API returns now), the logical
column number used in diagnostics, and the column number in terms of
cells in a monospace font. I'm not sure if the latter two quantities
should be separate, though. If we have APIs for the latter two
quantities, I can't think of anything that would need to do
computations on them, or would want the column number in bytes.

-Eli

(I don't know how the code is structured, I'm answering from a
conceptual point of view)

You could still use an integer for keeping track of the column number
if it's possible to change the "increment" operation. IIRC Unicode has
enough information to decide if the next code point will allocate a
new graphical space or not (and combining code points are stored after
the allocating code point). The new class does exactly the same, but
the "increment" approach would avoid creating a new class.

PS: This wouldn't handle text that changes direction in the middle of
the line, but I think editors don't deal with that either (gnome
editor, for example, starts counting in reverse if I write some text
from right to left). So, both line counts would be equally wrong. :slight_smile:

I've found a default algorithm for finding character boundaries
(called graphemes in Unicode). The most general version of the
algorithm works like this: a grapheme is either "CRLF" or is composed
of zero or more "Prepend" characters, followed by either a
"Hangul-syllabe" or any other code point that is not a control,
followed by a sequence of "Grapheme_Extend" or "Spacing_Mark" code
points. So, every "character" in the editor should have this format.
Because of "Prepend" characters, it's not possible to use the
increment approach directly. In the worst case, a state machine would
do the job.

UAX #29: Unicode Text Segmentation

Last post (today), I promise: (I didn't fully understand that but) it
seems that, given any two adjacent code points in the source file,
it's possible to determine if there's a grapheme break between them:

Grapheme Break Chart

Thanks - two extremely useful links.

However, I don't think I'm going to go into this much detail just yet. Specifically, we are writing a compiler and not a text rendering engine - and getting Unicode correct to this level is getting close to the typography realm, and can almost certainly be left as a display issue.

So my plan is to return a 'column number' in the diagnostic messages that refers to the Xth code point on that line, rather than the Xth code unit as today. This should be a big improvement, but as you note will still be off by some columns when viewing our messages in a console window or dumped to a UTF-8 text file. However, we *do* have correct information for IDEs to parse and get the right information to display accordingly. Most importantly, this number will be correct if we are round-tripping with other Unicode encodings, such as reading UTF-16.

The main reason I want to duck this is that it is going to involve some rather large lookup tables for potentially several hundred thousand Unicode code points. I'm not prepared to validate the data in such a table, and not sure about implementing an efficient lookup mechanism to avoid impacting performance - this is something much more appropriate for the bit-twiddlers than I (efficiency is something we all like to proclaim, but some folks are actually really good at it too! I'm not one of them :frowning:

And that brings me to the second issue I am going to duck.
I had originally planned on validating identifiers against ISO 10176, which is the Unicode specification for which code points should be usable as valid identifiers (vs. other forms of punctuation etc.) Again, this lookup is going to be costly in a performance critical part of the code, so I will leave that task to someone with the right skill set.

So I'm self-selecting my problem down to reporting badly encoded UTF-8, and reporting the number of code points, rather than code units, on external messages. Internally all code will continue to work in code units as the pointer arithmetic plain works!

I shall try to leave // FIXME tags in the appropriate places for someone to follow and provide the more subtle Unicode semantics.

AlisdairM