From: Chris Lattner [mailto:firstname.lastname@example.org]
Sent: 07 July 2009 06:10
To: AlisdairM (public)
Cc: 'clang-dev Developers'
Subject: Re: [cfe-dev] Confusing comment on LexTokenInternal
[points re-ordered for better reading]
I would strongly recommend decomposing the problems you're working on
into orthogonal pieces. Please attack unicode before (and
independently) of raw strings, or raw strings independently of
unicode. In LLVM and Clang, we strongly prefer incremental patches
that get us going in the right direction over massive patches that
implement an entire feature.
Yes, that is definitely the plan for submitting patches - I'm still trying to make sure I understand the likely solution space to I increment towards the right answer with minimal fuss. Once I know the end goal, I can pick the patch of least resistance to get there <g>
> As this code is clearly marked as performance sensitive, I am being
> quite careful before proposing changes for multiple string/character
> literal types. The simple drop-through for 'L' will no longer work
> as we have 10(!) possible literal prefixes in C++0x:
> Also, the R variants only apply to string literals, not character
Ok, eww :).
Oh, and it gets worse! I've not doubled this again to support user-defined-string-literals, which will also compound the number of character, floating point and integer literals we define. If I follow the existing scheme we will go from 2 string literal token types (tok:string_literal and tok::wide_string_literal) to 20!
Looking at how we might apply this in practice I see three 'dimensions' to reduce the combinatorial token explosion.
i/ rawness of a literal
ii/ user-defined or not
(i) and (ii) have no impact on string literal concatenation, which is purely defined by (iii). (iii) will also apply to character literals.
(ii) will carry around an (optional) extra identifier which is the name/signature of the literal-id function to call. (ii) is also appropriate for character literals, floating point literals and integer literals.
(i) is problematic in that it has to be handled by the pre-processor lexer to identify the end of the string-literal part of the token. If we don't encode this knowledge at this point we will have to parse it out again later each time we use the literal.
There is also the principle of u8 string literals which I propose to treat as a different 'representation' in (i), which is only supported for string literals (i.e not character literals). Although the storage (in Clang) identical to the 'naked' literal, the rules for string concatenation are more restrictive.
I *think* that is the full set of issues that hit parsing string literals for C++0x, hopefully I have missed nothing now.
So what does this mean in practice?
I want to kill tok::wide_string_literal and somehow stuff the encoding into tok::string_literal (char, char16_t, char32_t, wchar_t or u8 special. Options for other languages may be appropriate too). Any advice on how to approach this appreciated.
Second, I want to include the start/end range of the contents of a raw string literal - minus the delimiters. Again, this must be done by the lexer so suggest stuffing the information somewhere into the token. This suggests separate tokens for raw and non-raw (cooked?!) string literals.
Finally, user-defined literals for character, string, integer and floating point literals need somewhere to stuff their ud-suffix.
The 'obvious' idea that presents itself to me is to store this information the other side of the Token's PtrData pointer, although I'm not sure about the object lifetime management issues that would follow. Identifiers seem to do well stuffing an IdentifierInfo into there, although identifiers traditionally hang around much longer than literals, which tend to have a single use and then retire.