"Fixes" for two crashes, rant on Tok.getIdentifierInfo() and two more bugs

Hi,

the crash I reported and fixed earlier ( http://lists.cs.uiuc.edu/pipermail/cfe-dev/2007-December/000745.html ) happened because `Tok.getIdentifierInfo()` sometimes returns 0.

These conditions are not clearly documented in Token.h, and even if it was documented functions that may or may not return 0 are generally error prone. So I grepped clang for calls to `getIdentifierInfo()`. I found two places where this function was not handled correctly. Tests to reproduce the crashes and makeshift patches are attached (Someone familiar with the code needs to look at the FIXMEs in the patch. Problems where related to ObjC's @try/@catch and ObjC2 @interface prefixes).

(Why is it a good idea to treat stuff like @try as two tokens instead of one?)

Furthermore, I'd suggest to at least use an assert if you know that `getIdentifierInfo()` can't return 0 and rely on it. Doing an `assert(Tok.getIdentifierInfo() && "foo always has ident info")` serves as good documentation.

In the following places it was not immediately clear to me why the code is valid and `getIdentifierInfo` can't possibly return 0 (line numbers relative to rev 45360):

Lex/MacroExpander.cpp:
line 324

Lex/Preprocessor.cpp:
2222
2253
2329

Parse/ParseDecl.cpp:
101
1467

Parse/ParseExpr.cpp:
216
247
(785)

Parse/Parser.cpp:
377 (one of the bugs, fixed with the patch

Parse/ParseObjc.cpp:
304
325 (but only because of strange identation because of tabs instead of spaces -- fixed in the attached patch as well)
(476)
1130 (one of the bugs, fixed with the patch)
1164 (one of the bugs, fixed with the patch)
1235

Even better than adding asserts in these lines is to catch this problem with the compiler (for example, by putting `getIdentifierInfo()` in a subclass and never let it return 0. Then you _have_ to check for the right token type to call the method), but that's a bit of work :stuck_out_tongue:

An unrelated crash that I found on the way is:

     int main()
     {
       id a;
       [a bla:0 6:7];
     }

(crashes somewhere in sema, something like this should be put in test/Parse/objc-messaging-1.m)

And here's an inconsistency with gcc:

     int @interface bla ; // ?? this is valid objc?
     @end

I have no idea what this code is supposed to do, but it doesn't warn with clang but doesn't even compile with gcc.

Nico

ps: I also converted a few tabs to spaces

crashes.patch (7.08 KB)

the crash I reported and fixed earlier ( http://lists.cs.uiuc.edu/pipermail/cfe-dev/2007-December/000745.html ) happened because Tok.getIdentifierInfo() sometimes returns 0.

Right. Tokens that are not “pp-identifiers” in the lexer do not have an identifier pointer. This includes tokens like numbers (1), strings (“foo”), etc.

These conditions are not clearly documented in Token.h, and even if it was documented functions that may or may not return 0 are generally error prone. So I grepped clang for calls to getIdentifierInfo(). I found two places where this function was not handled correctly. Tests to reproduce the crashes and makeshift patches are attached (Someone familiar with the code needs to look at the FIXMEs in the patch. Problems where related to ObjC’s @try/@catch and ObjC2 @interface prefixes).

Nice! Your patch looks exactly right, I applied it here (after tweaking the expected-error stuff):
http://lists.cs.uiuc.edu/pipermail/cfe-commits/Week-of-Mon-20071224/003579.html

(Why is it a good idea to treat stuff like @try as two tokens instead of one?)

The answer is that thing like @ /comment/ try are legal, sadly enough. However, it seems that we could probably do something in the lexer (when it sees the “@”, to handle this. I’ll see what I can do about this when I have time.

Furthermore, I’d suggest to at least use an assert if you know that getIdentifierInfo() can’t return 0 and rely on it. Doing an assert(Tok.getIdentifierInfo() && "foo always has ident info") serves as good documentation.

Well, in theory the code should only call and deference getIdentifierInfo if it already knows. If it isn’t clear from the context of the call in the code, adding an assert makes sense.

In the following places it was not immediately clear to me why the code is valid and getIdentifierInfo can’t possibly return 0 (line numbers relative to rev 45360):

Lex/MacroExpander.cpp:
line 324

This is safe because previous code verified that the macro arguments are identifiers.

#define A(1)

should be rejected earlier. Adding an assert would make sense.

Lex/Preprocessor.cpp:
2222
2253
2329

The calls to ReadMacroName verify that the name is an identifier.

Parse/ParseDecl.cpp:
101

I’m not sure about this. That call is only reachable if “Tok.is(tok::identifier) || isDeclarationSpecifier()”. It is unclear to me that all declspecs have identifiers. Steve?

1467

assert(Tok.is(tok::kw_typeof) && “Not a typeof specifier”);
const IdentifierInfo *BuiltinII = Tok.getIdentifierInfo();

The assertion verifies that the token is a keyword, which has an identifier ptr. This code is trying to preserve typeof vs typeof in a diagnostic.

Parse/ParseExpr.cpp:
216

ParseExpressionWithLeadingIdentifier is only called with an identifier as IdTok.

247

likewise for ParseAssignmentExprWithLeadingIdentifier.

(785)

This is only called with these 4 keywords as the current token:
case tok::kw___builtin_va_arg:
case tok::kw___builtin_offsetof:
case tok::kw___builtin_choose_expr:
case tok::kw___builtin_types_compatible_p:

Parse/Parser.cpp:
377 (one of the bugs, fixed with the patch

Ok.

Parse/ParseObjc.cpp:
304
325 (but only because of strange identation because of tabs instead of spaces – fixed in the attached patch as well)
(476)
1130 (one of the bugs, fixed with the patch)
1164 (one of the bugs, fixed with the patch)
1235

Even better than adding asserts in these lines is to catch this problem with the compiler (for example, by putting getIdentifierInfo() in a subclass and never let it return 0. Then you have to check for the right token type to call the method), but that’s a bit of work :stuck_out_tongue:

This would also require the Token class to be polymorphic, which is a non-starter. Another potential solution would be to make getIdentifierInfo() always assert that the pointer is non-null. This would require callers to call Tok.hasIdentifierInfo() if they don’t know it is valid or to add a getIdentifierInfoOrNull() method.

An unrelated crash that I found on the way is:

int main()
{
id a;
[a bla:0 6:7];
}

(crashes somewhere in sema, something like this should be put in test/Parse/objc-messaging-1.m)

And here’s an inconsistency with gcc:

int @interface bla ; // ?? this is valid objc?
@end

I have no idea what this code is supposed to do, but it doesn’t warn with clang but doesn’t even compile with gcc.

I’ll let Steve and Fariborz chime in on these.

ps: I also converted a few tabs to spaces

Thanks! It would make it easier to review the patch if you kept the mechanical pieces separate from the changes that require review, but I appreciate the patch.

As an aside, things will probably pick up in early january, many people are out for the holidays.

-Chris

An unrelated crash that I found on the way is:

   int main()
   {
     id a;
     [a bla:0 6:7];
   }

(crashes somewhere in sema, something like this should be put in test/Parse/objc-messaging-1.m)

Should produce error not crash :). I look take a look.

And here's an inconsistency with gcc:

   int @interface bla ; // ?? this is valid objc?
   @end

I have no idea what this code is supposed to do, but it doesn't warn with clang but doesn't even compile with gcc.

This is invalid and should produce an error. However,

@interface bla ;
@end

(optional ';' is OK !!

- Fariborz

Okay, so it's quite possible to hack the lexer to merge these into two tokens. The problem with this is that it would lose the location info of the two pieces. I'd prefer to keep the lexer pure, and have it return perfect location info. This allows the client (eventually sema in this case) throw away loc info it doesn't need, but allows other clients to use it (e.g. -E mode).

This means we're stuck with the parser having to handle these as two tokens, sorry.

-Chris

Chris Lattner wrote:-

>> (Why is it a good idea to treat stuff like @try as two tokens
>> instead of one?)
>
> The answer is that thing like @ /*comment*/ try are legal, sadly
> enough. However, it seems that we could probably do something in
> the lexer (when it sees the "@", to handle this. I'll see what I
> can do about this when I have time.

Okay, so it's quite possible to hack the lexer to merge these into two
tokens. The problem with this is that it would lose the location info
of the two pieces. I'd prefer to keep the lexer pure, and have it
return perfect location info. This allows the client (eventually sema
in this case) throw away loc info it doesn't need, but allows other
clients to use it (e.g. -E mode).

This means we're stuck with the parser having to handle these as two
tokens, sorry.

Moreover I distinctly remember the justification for two tokens was
to enable macro expansion of the second part; this was a specific
requirement Apple (Stan) had.

Neil.

This means we're stuck with the parser having to handle these as two
tokens, sorry.

Moreover I distinctly remember the justification for two tokens was
to enable macro expansion of the second part; this was a specific
requirement Apple (Stan) had.

Right, I was planning on lexing them as two tokens (allowing each to be macro expanded) and applying 'magic' to return them as one token from the lexer. The problem being that there is only space for one location info.

-Chris