Problem in locations

If I try to compile the attached program I get:

$ ~/llvm/Debug/bin/clang-cc -pedantic -std=c89 z.c
z.c:2:9: warning: variable declaration in for loop is a C99-specific feature
  for ( \
        ^
1 diagnostic generated.

The token start is not indicated in the position of the "i" of "int" but
in the previous line and the token length is set to 5.

Is it intentional or it is a bug?

IMHO to have a leading \newline as part of the token confuses the
diagnostic without benefits.

z.c (64 Bytes)

That is perhaps not the best quality of implementation for the diagnostic, but it is intended. You're hitting issues that are due to the phases of translation in C. The first phase removes escaped newlines (which, as a gnu extension, can be followed by horizontal whitespace... urg) and trigraphs. Because the lexer fully integrates the various phases of translation, a source location for a token returns the first byte of the file that is part of that token. In this case, it is the escaped newline.

If you ask the preprocessor to get the 'spelling' for the token, you'll get 'int'. If you want the location of the first n'th actual character "i" "n" "t" from the token, you'll need to use something like StringLiteralParser::getOffsetOfStringByte (but without the 'escape' processing) to advance over escaped newlines and trigraphs.

-Chris

Chris Lattner ha scritto:

If I try to compile the attached program I get:

$ ~/llvm/Debug/bin/clang-cc -pedantic -std=c89 z.c
z.c:2:9: warning: variable declaration in for loop is a C99-specific
feature
for ( \
       ^
1 diagnostic generated.

The token start is not indicated in the position of the "i" of "int" but
in the previous line and the token length is set to 5.

Is it intentional or it is a bug?

IMHO to have a leading \newline as part of the token confuses the
diagnostic without benefits.
int p() {
for ( \
int i = 0; i < 10; ++i)
   ;
return 0;
}

That is perhaps not the best quality of implementation for the
diagnostic, but it is intended. You're hitting issues that are due to
the phases of translation in C. The first phase removes escaped
newlines (which, as a gnu extension, can be followed by horizontal
whitespace... urg) and trigraphs. Because the lexer fully integrates
the various phases of translation, a source location for a token returns
the first byte of the file that is part of that token. In this case, it
is the escaped newline.

Why the escaped newline is not considered the last part of the whitespace?

What do you mean? Escaped newline can occur anywhere in a token, the start of the token isn't special:

foo\
bar??/
baz

is one token.

-Chris

Chris Lattner ha scritto:

IMHO to have a leading \newline as part of the token confuses the
diagnostic without benefits.
int p() {
for ( \
int i = 0; i < 10; ++i)
  ;
return 0;
}

That is perhaps not the best quality of implementation for the
diagnostic, but it is intended. You're hitting issues that are due to
the phases of translation in C. The first phase removes escaped
newlines (which, as a gnu extension, can be followed by horizontal
whitespace... urg) and trigraphs. Because the lexer fully integrates
the various phases of translation, a source location for a token returns
the first byte of the file that is part of that token. In this case, it
is the escaped newline.

Why the escaped newline is not considered the last part of the
whitespace?

What do you mean? Escaped newline can occur anywhere in a token, the
start of the token isn't special:

foo\
bar??/
baz

is one token.

Of course, but I believe that a leading ignorable should be ignored, not
included.

Take as an example
foo \
bar\
baz

there are two tokens "foo" and "barbaz", the text associated could be
"foo" and "bar\
baz" or
"foo" and "\
bar\
baz".

I don't see the reason to prefer the latter to the former. I'd think
that to consider " \
" and not only " " as whitespace is a more rational alternative.

Take also in consideration that the standard says that the preprocessing
token are identified after escaped newline removal.

You're right, the implementation is more efficient this way though.

-Chris

Rather you're right that it "could" ignore the leading escape, but I don't see any reason that clients should depend on this.

-Chris

Chris Lattner ha scritto:

What do you mean? Escaped newline can occur anywhere in a token, the
start of the token isn't special:

foo\
bar??/
baz

is one token.

Of course, but I believe that a leading ignorable should be ignored,
not
included.

You're right, the implementation is more efficient this way though.

Rather you're right that it "could" ignore the leading escape, but I
don't see any reason that clients should depend on this.

The diagnostic is far more readable if we point to the first non
ignorable character instead of the first ignorable character and it
seems to me that to have an indefinite number of ignorable embedded in a
single char token is a bit senseless.

\
\
\
\
\
\
\

Also it seems strange to me under a theoretical point of view that the
efficiency is undermined by the more rational behaviour, but certainly
you know better than me.
  
this seems a pretty corner case, so why not privileging efficiency?

If I'd see a way that do not compromise the efficiency are you willing
to accept a patch?
  
you can probably modify the diagnostic code emision to check and skip ignorable char at the start of token instead of modifying the token creation. diagnostic is not in the hot path so you have a can use more computational cycles.

just my 2cents
regards,

Cédric

Rather you're right that it "could" ignore the leading escape, but I
don't see any reason that clients should depend on this.

The diagnostic is far more readable

Yes, it is a QoI issue in the diagnostics subsystem. This doesn't actually happen commonly in practice though, which is why we haven't bother to do anything about it.

if we point to the first non
ignorable character instead of the first ignorable character and it
seems to me that to have an indefinite number of ignorable embedded

Sure, of course, this is the "beauty" of C.

Note also that current implementation of getSpellingLineNumber and
getSpellingColumnNumber are confused by this thing and they return the
position less useful instead of the more useful one.

Can you be more specific about what you mean?

My basic point is that clients have to be aware of phase 1 translation issues anyway, why is the first character of a token any more special than other characters?

-Chris