Confusing comment on LexTokenInternal

I'm looking at how to add additional literals once Unicode types are supported, which naturally takes me to the LexTokenInternal method of the Lexer class. The comment describing the function talks about returning true/false, yet the function is void so cannot return anything. I suspect the API has evolved since the comment was written, but it makes me wonder about the validity of other parts of this comment, such as assumed null-termination of the buffer.

As this code is clearly marked as performance sensitive, I am being quite careful before proposing changes for multiple string/character literal types. The simple drop-through for 'L' will no longer work as we have 10(!) possible literal prefixes in C++0x:

  <empty>
  L
  R
  U
  u
  u8
  LR
  UR
  uR
  u8R

Also, the R variants only apply to string literals, not character literals.

We must preserve enough info to continue parsing as an identifier if we do not find the ' or " character. In the odd case we find a ' following an R I believe we parse the chars up to and including R as an identifier (maybe a macro) and start a fresh narrow-character literal token with the ' and no intervening whitespace.

Yuck :frowning:

I also notice that characters beyond the 7-bit ASCII range are deemed 'unknown' rather than potential identifiers. Ideally we should be checking for a valid UTF-8 sequence and allowing through as an identifier in that case (consuming the full glyph). This seems a perquisite to adding UCN support for identifiers, as UCNs will (presumably) map to the equivalent UTF-8 sequence. However, I am not sure how much further investigation here should look into GCC ABI to describe mangling of identifiers with extended characters.

AlisdairM

I'm looking at how to add additional literals once Unicode types are supported, which naturally takes me to the LexTokenInternal method of the Lexer class. The comment describing the function talks about returning true/false, yet the function is void so cannot return anything.

Fixed, thanks!

I suspect the API has evolved since the comment was written, but it makes me wonder about the validity of other parts of this comment, such as assumed null-termination of the buffer.

Everything else in that function's comment still holds. Buffers are still required to be nul terminated.

As this code is clearly marked as performance sensitive, I am being quite careful before proposing changes for multiple string/character literal types. The simple drop-through for 'L' will no longer work as we have 10(!) possible literal prefixes in C++0x:

<empty>
L
R
U
u
u8
LR
UR
uR
u8R

Also, the R variants only apply to string literals, not character literals.

Ok, eww :).

We must preserve enough info to continue parsing as an identifier if we do not find the ' or " character. In the odd case we find a ' following an R I believe we parse the chars up to and including R as an identifier (maybe a macro) and start a fresh narrow-character literal token with the ' and no intervening whitespace.

Yuck :frowning:

This is probably after the first translation phase, so you also have to thing about newlines and trigraphs, double yuck :slight_smile:

I also notice that characters beyond the 7-bit ASCII range are deemed 'unknown' rather than potential identifiers. Ideally we should be checking for a valid UTF-8 sequence and allowing through as an identifier in that case (consuming the full glyph). This seems a perquisite to adding UCN support for identifiers, as UCNs will (presumably) map to the equivalent UTF-8 sequence. However, I am not sure how much further investigation here should look into GCC ABI to describe mangling of identifiers with extended characters.

Yes, characters in the 128->255 range should definitely be considered potential identifiers at some point (when we support unicode better).

I would strongly recommend decomposing the problems you're working on into orthogonal pieces. Please attack unicode before (and independently) of raw strings, or raw strings independently of unicode. In LLVM and Clang, we strongly prefer incremental patches that get us going in the right direction over massive patches that implement an entire feature.

-Chris

From: Chris Lattner [mailto:clattner@apple.com]
Sent: 07 July 2009 06:10
To: AlisdairM (public)
Cc: 'clang-dev Developers'
Subject: Re: [cfe-dev] Confusing comment on LexTokenInternal

[points re-ordered for better reading]

I would strongly recommend decomposing the problems you're working on
into orthogonal pieces. Please attack unicode before (and
independently) of raw strings, or raw strings independently of
unicode. In LLVM and Clang, we strongly prefer incremental patches
that get us going in the right direction over massive patches that
implement an entire feature.

Yes, that is definitely the plan for submitting patches - I'm still trying to make sure I understand the likely solution space to I increment towards the right answer with minimal fuss. Once I know the end goal, I can pick the patch of least resistance to get there <g>

> As this code is clearly marked as performance sensitive, I am being
> quite careful before proposing changes for multiple string/character
> literal types. The simple drop-through for 'L' will no longer work
> as we have 10(!) possible literal prefixes in C++0x:
>
> <empty>
> L
> R
> U
> u
> u8
> LR
> UR
> uR
> u8R
>
> Also, the R variants only apply to string literals, not character
> literals.

Ok, eww :).

Oh, and it gets worse! I've not doubled this again to support user-defined-string-literals, which will also compound the number of character, floating point and integer literals we define. If I follow the existing scheme we will go from 2 string literal token types (tok:string_literal and tok::wide_string_literal) to 20!

Looking at how we might apply this in practice I see three 'dimensions' to reduce the combinatorial token explosion.

i/ rawness of a literal
ii/ user-defined or not
iii/ representation

(i) and (ii) have no impact on string literal concatenation, which is purely defined by (iii). (iii) will also apply to character literals.

(ii) will carry around an (optional) extra identifier which is the name/signature of the literal-id function to call. (ii) is also appropriate for character literals, floating point literals and integer literals.

(i) is problematic in that it has to be handled by the pre-processor lexer to identify the end of the string-literal part of the token. If we don't encode this knowledge at this point we will have to parse it out again later each time we use the literal.

There is also the principle of u8 string literals which I propose to treat as a different 'representation' in (i), which is only supported for string literals (i.e not character literals). Although the storage (in Clang) identical to the 'naked' literal, the rules for string concatenation are more restrictive.

I *think* that is the full set of issues that hit parsing string literals for C++0x, hopefully I have missed nothing now.

So what does this mean in practice?

I want to kill tok::wide_string_literal and somehow stuff the encoding into tok::string_literal (char, char16_t, char32_t, wchar_t or u8 special. Options for other languages may be appropriate too). Any advice on how to approach this appreciated.

Second, I want to include the start/end range of the contents of a raw string literal - minus the delimiters. Again, this must be done by the lexer so suggest stuffing the information somewhere into the token. This suggests separate tokens for raw and non-raw (cooked?!) string literals.

Finally, user-defined literals for character, string, integer and floating point literals need somewhere to stuff their ud-suffix.

The 'obvious' idea that presents itself to me is to store this information the other side of the Token's PtrData pointer, although I'm not sure about the object lifetime management issues that would follow. Identifiers seem to do well stuffing an IdentifierInfo into there, although identifiers traditionally hang around much longer than literals, which tend to have a single use and then retire.

AlisdairM

unicode. In LLVM and Clang, we strongly prefer incremental patches
that get us going in the right direction over massive patches that
implement an entire feature.

Yes, that is definitely the plan for submitting patches - I'm still trying to make sure I understand the likely solution space to I increment towards the right answer with minimal fuss. Once I know the end goal, I can pick the patch of least resistance to get there <g>

Great!

As this code is clearly marked as performance sensitive, I am being
quite careful before proposing changes for multiple string/character
literal types. The simple drop-through for 'L' will no longer work
as we have 10(!) possible literal prefixes in C++0x:

<empty>
L
R
U
u
u8
LR
UR
uR
u8R

Also, the R variants only apply to string literals, not character
literals.

Ok, eww :).

Oh, and it gets worse! I've not doubled this again to support user-defined-string-literals, which will also compound the number of character, floating point and integer literals we define. If I follow the existing scheme we will go from 2 string literal token types (tok:string_literal and tok::wide_string_literal) to 20!

Ok, if this is the case, it is probably better to go from two token types to one (just string_literal) and have the literal parser stuff actually do the categorization. I think the interesting clients all using the literal parser anyway.

So what does this mean in practice?

I want to kill tok::wide_string_literal and somehow stuff the encoding into tok::string_literal (char, char16_t, char32_t, wchar_t or u8 special. Options for other languages may be appropriate too). Any advice on how to approach this appreciated.

Makes sense to me! Do you actually need to encode this in the *Token*? Could you just have StringLiteralParser determine these properties?

Second, I want to include the start/end range of the contents of a raw string literal - minus the delimiters. Again, this must be done by the lexer so suggest stuffing the information somewhere into the token. This suggests separate tokens for raw and non-raw (cooked?!) string literals.

I think that the main lexer should just get the full extent of the token and store it in the token (as it does with all other tokens). The string literal parser would then "relex" the contents when needed to remove the delimiters etc.

-Chris

From: Chris Lattner [mailto:clattner@apple.com]
Sent: 07 July 2009 18:25
To: AlisdairM (public)
Cc: 'clang-dev Developers'
Subject: Re: [cfe-dev] Confusing comment on LexTokenInternal

> I want to kill tok::wide_string_literal and somehow stuff the
> encoding into tok::string_literal (char, char16_t, char32_t, wchar_t
> or u8 special. Options for other languages may be appropriate too).
> Any advice on how to approach this appreciated.

Makes sense to me! Do you actually need to encode this in the
*Token*? Could you just have StringLiteralParser determine these
properties?

I guess that makes sense. The lexer doesn't want to attribute any meaning to the prefix/suffix on the literal, merely find meaningful bounds. String literal concatenation happens in the parser, and this is probably the first time we really care about representation.

So I guess the first step is to kill tok::wide_string_literal, kill the Boolean flag to Lexer::LexStringLiteral, and carry the prefix (in the token's SourceRange) through the APIs into StringParser.

Then perform a matching change for character literals.

Then we can look into adding Unicode character types, or recognising suffices for user-defined literals as part of the same token. Note: A user-defined literal is effectively a disguised function call syntax, although they might become more 'literal-like' when someone (not me!) gets around to implementing constexpr.

> Second, I want to include the start/end range of the contents of a
> raw string literal - minus the delimiters. Again, this must be done
> by the lexer so suggest stuffing the information somewhere into the
> token. This suggests separate tokens for raw and non-raw (cooked?!)
> string literals.

I think that the main lexer should just get the full extent of the
token and store it in the token (as it does with all other tokens).
The string literal parser would then "relex" the contents when needed
to remove the delimiters etc.

This means duplicating some work, but probably not too much as the delimiters are limited to 16 chars max. The premature optimiser in me want to do the work (and maintain it!) once and no more, but I'm not about to fight the data structures to make it happen - that is rarely a good sign.

If that sounds right, then it is time for me to stop talking and start cleaning up some patches ;¬)

AlisdairM

I want to kill tok::wide_string_literal and somehow stuff the
encoding into tok::string_literal (char, char16_t, char32_t, wchar_t
or u8 special. Options for other languages may be appropriate too).
Any advice on how to approach this appreciated.

Makes sense to me! Do you actually need to encode this in the
*Token*? Could you just have StringLiteralParser determine these
properties?

I guess that makes sense. The lexer doesn't want to attribute any meaning to the prefix/suffix on the literal, merely find meaningful bounds. String literal concatenation happens in the parser, and this is probably the first time we really care about representation.

Right. It also uses the Literal stuff to parse it.

So I guess the first step is to kill tok::wide_string_literal, kill the Boolean flag to Lexer::LexStringLiteral, and carry the prefix (in the token's SourceRange) through the APIs into StringParser.

Yep.

Then perform a matching change for character literals.

Yep.

Then we can look into adding Unicode character types, or recognising suffices for user-defined literals as part of the same token. Note: A user-defined literal is effectively a disguised function call syntax, although they might become more 'literal-like' when someone (not me!) gets around to implementing constexpr.

sounds great!

I think that the main lexer should just get the full extent of the
token and store it in the token (as it does with all other tokens).
The string literal parser would then "relex" the contents when needed
to remove the delimiters etc.

This means duplicating some work, but probably not too much as the delimiters are limited to 16 chars max. The premature optimiser in me want to do the work (and maintain it!) once and no more, but I'm not about to fight the data structures to make it happen - that is rarely a good sign.

If that sounds right, then it is time for me to stop talking and start cleaning up some patches ;¬)

Sounds great, thanks!

-Chris

OK, tried it, and time to scratch that plan already!
The problem is not that string_literal cannot handle the wide_string_literal cases, that was easy to fix up. However, there are a few places in the grammar that require string literal be exactly that - a narrow string literal. Examples are #include "myfile" and extern "C".

Now I could try and stuff a flag into the token to indicate it truly is a narrow string literal - but we already have that effect with the two separate tokens. That seems to be working and is quite well tested by now so I think we should keep this in place.

The new plan is to repurpose wide_string_literal to cover any annotated string literal i.e. with any prefix or suffix. 'Annotated' seems to have other connotations though so I'm looking for a better term. In the meantime I'll put the foundation in for wide_string_literal to handle the 19 other cases that string_literal does not.

AlisdairM

Sounds like a good compromise to me! How about string_literal and complex_string_literal or something like that?

-Chris

AlisdairM(public) wrote:-

The problem is not that string_literal cannot handle the wide_string_literal cases, that was easy to fix up. However, there are a few places in the grammar that require string literal be exactly that - a narrow string literal. Examples are #include "myfile" and extern "C".

That doesn't seem like a reason to scrap the plan. All such places
need the spelling as they're going to do something with it, such
as check it is "C" or "C++" in the extern case. Rather than check
the token type to be a narrow string, they simply need to check
spelling[0] is a single quote.

Neil.

The difference is that code is already in place and works, acting purely off the existing token definitions. If I decide to merge, it is not enough to remove all references to wide_string_literal (easy) but I need to make sure I do not miss any references to string_literal that do NOT have additional logic for wide_string_literal and add the additional logic to test for a prefix *or suffix*. The risk of introducing bugs into already working, tested code is too high for me.

(yes, your suggestion was my initial reaction too. I simply did not like what my code audit turned up)

AlisdairM