A development plan for enhanced character handling

Follow-up to my stream of earlier emails!

My initial pilot-project to get familiar with clang and its development
process has grown somewhat, so I thought I better outline a project plan
before I get too far.

So first question:
  Is there a recognised process for initiating/tracking such projects?
  Or is the community small and informal enough that this mailing list is
sufficient?

The plan for my project is to support:
  New C++0x Unicode character types and literals
  C99 Unicode TR character types
  C++0x raw string literals (C++ only, or would ObjectiveC/C++ be
interested?)
  Support for UCNs in identifiers
  Support source files encoded as UTF-8/UTF-16/UTF-32

The last of those points is probably most controversial, and I need clear
guidance on how to do this efficiently.

For source files, the ultimate plan would be to read the file into memory,
then check for a BOM.

If no BOM is present, assume file is UTF-8 and proceed as today.
  Permit but do not require a UTF-8 BOM
If a UTF-16/32 BOM is present, transcode the file into UTF-8
  Pass this UTF-8 encoded buffer into the pre-processor, as today

What is not clear is whether I should keep the original source file in
memory to help when reporting diagnostics, or whether we simply keep the
UTF-8 buffer knowing we can transcode text back to the original encoding on
demand if necessary.

We can assume a 1-1 correspondence of characters between these encodings, so
I don't foresee a problem working purely with UTF-8 as today, and
transcoding to the wider formats on demand the few times the original
encoding matters.

However, in order to ensure success the first task must be to audit the
existing code to be sure that it correctly handles source files with UTF-8
characters outside the basic ASCII set. Any issues here should probably be
my first order of business.

Assuming the code-audit passes, my provision project plan is to implement in
the following stages, with a commit after each stage. The first task is the
simplest, so I will use that to learn the protocols of adequate testing,
documentation etc. for a check-in review.

Suggested implementation sequence:
  UTF-8 literals u8"tra-la-la"
    do not accidentally invent u8 character literals
    concatenation with wchar_t L"literal" is diagnosable error
    do not break regular narrow literal concatenation with
wchar_t

  implement native Unicode char types
    char16_t/char32_t for C++
    _Char16_t/_Char32_t for C99 Unicode TR
    define the 'always Unicode' macro specified in Unicode TR
      _ _ STDC_ISO_10646 _ _ == yyyymmL for year/month of
latest spec supported
      implies wchar_t is a unicode encoding

  implement Unicode character literals
    single characters only
    char16_t must be from basic character plane

  implement Unicode string literals
    involves expanding the 'AnyWide' bool to support
char/char16_t/char32_t/wchar_t, and maybe u8
    Do not drag in raw literals at this point, combinatorial
flag explosion
    must define our own heterogeneous concatenation rules
      recommend all conditionally supported conversions
diagnosable errors for initial check-in

  define and implement any support for heterogeneous string
concatenation
    note rules:
      char -> wchar_t required
      no u8 -> wchar_t
      u8 -> char32_t conditionally supported
      char -> char32_t required
      char32_t -> wide conditionally supported

  implement raw string literals
    [Core issue] must any non-basic-source character be treated
as 6 (or 10) d-chars?
      recommend we issue a diagnostic warning, but accept
code
      Beman concerned this encourages non-portable code

  Finally implement non-UTF-8 file support
    [[pre-requisite - be sure the Unicode transcoding utilities
are all implemented and validated]]
    Read/map file into memory
    Check for BOM
    Permit but do not require UTF-8 BOM
    If BOM is missing
      assume UTF-8 encoding
    else
      flag source file as that encoding
      If encoding is not supported
        issue a diagnostic
      If encoding is not UTF-8
        transcode file and pass on transcoded buffer
        Q: do we retain original buffer for
diagnostics?
        Q: do we transcode diagnostic messages back
to source encoding on the fly?
          we have UTF-8 buffer so can
round-trip just that source
  
  Done!
    (apart from processing bug reports)

As for a time-table, this is a fair chunk of work now so I don't plan on
anything beyond u8 literal support this side of the next C++ standards
meeting in Frankfurt, as that will claim the lion's share of my attention
until then.

AlisdairM

Follow-up to my stream of earlier emails!

My initial pilot-project to get familiar with clang and its development
process has grown somewhat, so I thought I better outline a project plan
before I get too far.

So first question:
Is there a recognised process for initiating/tracking such projects?
Or is the community small and informal enough that this mailing list is
sufficient?

This mailing list is sufficient for coordination; there aren't so many
people that we've ever needed a formal process.

The plan for my project is to support:
       New C++0x Unicode character types and literals
       C99 Unicode TR character types
       C++0x raw string literals (C++ only, or would ObjectiveC/C++ be
interested?)
       Support for UCNs in identifiers
       Support source files encoded as UTF-8/UTF-16/UTF-32

All of those look good; they're mostly independent, though.

Note that Objective-C is generally considered as an extension to some
base language. Therefore, ObjectiveC on C99 or C++98 wouldn't get raw
string literals, but ObjectiveC on C++0x would get them.

The last of those points is probably most controversial, and I need clear
guidance on how to do this efficiently.

For source files, the ultimate plan would be to read the file into memory,
then check for a BOM.

If no BOM is present, assume file is UTF-8 and proceed as today.
       Permit but do not require a UTF-8 BOM
If a UTF-16/32 BOM is present, transcode the file into UTF-8
       Pass this UTF-8 encoded buffer into the pre-processor, as today

What is not clear is whether I should keep the original source file in
memory to help when reporting diagnostics, or whether we simply keep the
UTF-8 buffer knowing we can transcode text back to the original encoding on
demand if necessary.

We have to transcode on demand for generality: suppose the file is in
UTF-16 and the terminal is ShiftJIS. I can't think of any situation
where we would need access to the file in its original encoding.

However, in order to ensure success the first task must be to audit the
existing code to be sure that it correctly handles source files with UTF-8
characters outside the basic ASCII set. Any issues here should probably be
my first order of business.

We don't calculate column numbers correctly in various cases involving
non-ASCII characters; I can't think of any other issues.

Somewhat nasty testcase for the column numbers:
void 風; void 風; void 風; void 風; void 風; void 風;

-Eli