Finally found time to hack a little code, starting with better detection of
My plan is to detect a BOM in ContentCache::getBuffer() and initially flag
an "unsupported encoding" error if a BOM is detected.
I think this is probably too low-level of a place to put this check. The clients of getBuffer() should do the check.
A follow-up patch would then transcode from the detected encoding into a
UTF-8 buffer before returning from the function, and flag an appropriate
error if the file then turns out not be the correctly encoded after all.
Can you sketch out your implementation approach? Will you *just* transcode to UTF8, or would you also change UCNs as well? In my (surely naive) view, we'd want to just transcode the file encoding but leave UCNs in the file, so that diagnostics print the high characters if present and UCNs if present without conflating them. OTOH, the identifier logic would canonicalize both to an internal form when looking up the identifier for a token.
Is this workable?
I can see how the DiagnosticBuilder API works in other code, so I'm happy
with the idea of re-using an existing "not supported" flag or creating a new
one. My problem is that I can't see how to hook a DiagnosticBuilder in the
first place - there does not seem to be one available from within a member
function of ContentCache and I'm not sure how else to obtain one.
Is the idea that SourceManager would always return buffers in UTF8 form? If the check has to be at this low of a level, I think it would be best to make sourcemanager return an error code, and then have the preprocessor (and other clients) emit the diagnostic as appropriate. Emitting the diagnostic from SM directly would be problematic because it doesn't know *why* the file was trying to be read (e.g. in response to a #include etc) so it wouldn't be able to produce good diagnostic info.
BTW, I have largely turned my UTF-x support plans on their head, and will
start with diagnosing bad encodings,
Excellent, building from the bottom up sounds great.
then support extended characters in
identifiers (which also means correcting column numbers in diagnostics) and
UCNs, before adding support for C++0x Unicode types and literals, and then
(finally!) raw string literals.
If I have a file with:
I would expect "x" to have a column number of 9, not of 4. However, if the high character was written as a single high character, I would expect it to have a column number of 4. Do you agree?
Thanks for tackling this Alisdair!