Fixits with multibyte chars

Hi, everyone. We recently hit an assertion when trying to output a fixit with Unicode characters in it; it reduces down to this:

void test() {
  printf("∆: %d", 1L);
}

I could of course just disable fixits when there are Unicode characters involved, but I'd like to fix this the right way. The trouble is -fdiagnostics-parseable-fixits, which is supposed to be machine-readable output, and in this case is a three-byte UTF-8 character three columns or one column? I think one column is the right way to go, but I wanted to get some other opinions before I start working on a patch.

I'd be getting around to this soon anyway; it's blocking PR13178 (fixit for smart quotes).

Thanks,
Jordan

Hi, everyone. We recently hit an assertion when trying to output a fixit with Unicode characters in it; it reduces down to this:

void test() {
printf("∆: %d", 1L);
}

I could of course just disable fixits when there are Unicode characters involved, but I'd like to fix this the right way. The trouble is -fdiagnostics-parseable-fixits, which is supposed to be machine-readable output, and in this case is a three-byte UTF-8 character three columns or one column? I think one column is the right way to go, but I wanted to get some other opinions before I start working on a patch.

This actually depends on the system. On some systems we'll print the unicode codepoint in hex, others will get the 1 column char. There is the llvm::sys::locale::columnWidth function to get this information in a portable way.

- Ben

Okay, that gives us two problems, then…for user-visible fixits we can use llvm::sys::locale::columnWidth (thanks, Ben), but then -fdiagnostics-parseable-fixits will have different column numbers? Is that okay?

(Currently -fdiagnostics-parseable-fixits counts columns in bytes rather than characters.)

Machine-parsable "columns" are not the same as columns in the
terminal. I'm pretty sure the model we want is one "column" per
Unicode code point, regardless of how it is displayed.

-Eli

How should we behave if the file contains a byte sequence which is not valid UTF-8 (for instance, if arbitrary raw data is placed inside a raw string literal)? For the machine-parsable form, I’d feel more comfortable with bytes from the start of the line for this reason.

Source code which isn't valid UTF-8 is illegal, even in raw string
literals. That said, we allow it in some cases anyway, so we need to
recover consistently.

We could define a consistent scheme for data which isn't UTF-8, but
you're right, it might be easier to just use "bytes since the last
newline".

-Eli

One problem with column numbers is that different consumers find different definitions useful. The two main use cases for column numbers are for visual display and for indexing into an array containing some representation of the source in order to do something with that location.

The first case is intimately tied to the specific renderer displaying the source. Depending on the rendering, column numbers could take into account combining characters, wide characters, character's whose width varies (tabs) or is different with different fonts, etc. The second case needs different column numbers depending on how the source is encoded; a 'column number' could index a UTF-8 code unit, a UTF-16 code unit, a codepoint/UTF-32 code unit, or a code unit for some other encoding.

Most of these complexities aren't currently taken into account. Most everywhere 'column number' is an index into the source buffer (though one indexed instead of zero indexed), and that means the value is in UTF-8 code units. I did some work on the console diagnostics display for unicode where I try to convert from source index column numbers to console display column numbers using wcswidth. These console display column numbers aren't used anywhere except to get the range highlighting and fixits to display with the correct alignment. That change was mostly in r154980 / git commit 6749dd50869281.

We need to provide column numbers that are reasonably useful to any consumer. Text renderers will have their own method of calculating column numbers so we just need to provide indexes into the source that anyone can use. Providing column numbers in terms of code units for a particular encoding means a client that uses that encoding can directly index into their source line buffer. That also forces clients using other encodings to go through contortions to use that column number. Alternatively, column numbers in terms of code points should be relatively straightforward for any client using any Unicode encoding to convert that into a column number they can use.

Another option would be to provide column numbers in terms of code units, but give options as to the encoding. So clients that use a particular representation could request column numbers directly relevant to them.

- Seth

Hello,

[...]
I could of course just disable fixits when there are Unicode characters involved, but I'd like to fix this the right way. The trouble is -fdiagnostics-parseable-fixits, which is supposed to be machine-readable output, and in this case is a three-byte UTF-8 character three columns or one column? I think one column is the right way to go, but I wanted to get some other opinions before I start working on a patch.

I would prefer one column, too. If however this cannot be implemented reliably in all cases,
I think that a line which is a bit too short would be better than a line which is too long:
At least it doesn't get wrapped unintentionally.

Jonathan

How should we behave if the file contains a byte sequence which is not valid
UTF-8 (for instance, if arbitrary raw data is placed inside a raw string
literal)? For the machine-parsable form, I’d feel more comfortable with
bytes from the start of the line for this reason.

Source code which isn’t valid UTF-8 is illegal, even in raw string
literals. That said, we allow it in some cases anyway, so we need to
recover consistently.

We could define a consistent scheme for data which isn’t UTF-8, but
you’re right, it might be easier to just use “bytes since the last
newline”.

-Eli

Three things about that:

(1) The C standard explicitly permits multibyte characters (of arbitrary encoding) in 5.2.1.2. C++ [lex.phases]p1.1 implies a similar idea (non-basic characters are conceptually mapped into the source character set using UCNs). So saying non-UTF-8 code is “illegal” is sort of arbitrary; I’m not sure if we’ve ever documented this restriction, but we certainly don’t enforce it.

(2) This is for fixits, which can appear within invalid code. I’m working on Unicode recovery in a private branch.

(3) Chris has said we assume UTF-8 for now.

We want to assume that the input charset is in UTF8 for now. If we ever add support for other code pages, we’ll either do it by rewriting the entire buffer all at once ahead of time (this is really the only option if the input is in UTF16), or by doing something else crazy like pervasively making the lexer know about single-byte codepages. Since we only support UTF8 for now, I’d just start there.

( http://llvm.org/bugs/show_bug.cgi?id=13178#c7 )

In order to fix the crasher, I’ll change the human-readable output to use printed columns, using sys::locale::columnWidth as suggested by Ben. I’ll leave the parseable diagnostics as-is for now, but add a note to our manual that machine-parseable ranges use “bytes from the beginning of the line” as the index. We can revisit that at any time.

Jordan

One more point of interest: the normal column numbers (-fshow-column) also use byte offsets, not character counts. I've added the current behavior to the user manual (and fixed the crash for the fixit line) in r160319.

Jordan

How should we behave if the file contains a byte sequence which is not valid
UTF-8 (for instance, if arbitrary raw data is placed inside a raw string
literal)? For the machine-parsable form, I'd feel more comfortable with
bytes from the start of the line for this reason.

Source code which isn't valid UTF-8 is illegal, even in raw string
literals. That said, we allow it in some cases anyway, so we need to
recover consistently.

We could define a consistent scheme for data which isn't UTF-8, but
you're right, it might be easier to just use "bytes since the last
newline".

-Eli

Three things about that:

(1) The C standard explicitly permits multibyte characters (of arbitrary
encoding) in 5.2.1.2. C++ [lex.phases]p1.1 implies a similar idea (non-basic
characters are conceptually mapped into the source character set using
UCNs). So saying non-UTF-8 code is "illegal" is sort of arbitrary; I'm not
sure if we've ever documented this restriction, but we certainly don't
enforce it.

"Illegal", in the sense that clang makes the implementation-defined
decision to unconditionally interpret source code as UTF-8. And as
far I know, the only place where clang will silently accept bytes
which don't form valid UTF-8 codepoints is in comments.

(2) This is for fixits, which can appear within invalid code. I'm working on
Unicode recovery in a private branch.

Right.... I wasn't arguing we can just ignore them.

-Eli