Still trying to wrap my head around the C++0x explosion of literals.
I'm thinking of taking yet another step back from the complex_string_literal idea, and storing the ud-suffix of a user defined literal in its own token. That way, the same mechanism can work for user-defined string literals, character literals, integer literals and floating point literals. The ud-suffix has to be an identifier, and we already have logic for that.
My inexperienced question is - when we move from lex to parse, is it easy to identify there is no whitespace between the literal component and the ud-suffix, or should I be trying to take care of this in the lexing phase?
User-defined suffices aside, the remaining work for 9 of the 10 possible string literals seems to fall naturally for a single prefixed_string_literal token.
AlisdairM
You can check in the parser with hasLeadingSpace(), but it's probably
better to handle it in the lexer; it would massively complicate the
parser to check for a string in every single place that expects an
identifier.
-Eli
[Sorry Eli - I will get the hang of 'reply-all' soon...]
It probably doesn't matter too much whether the tokens are separate; I
just think it needs to be resolved in the lexer. It sounded like you
were intending to reuse tok::identifier for these suffixes, and that
sounds like a bad idea.
-Eli
Oh, sorry. No, I intended to use tok::identifier as the model to learn from. This is still early days in the compiler-writing business for me ;¬) I learn best by adapting existing examples, and am trying to feel my way into the compiler with 'low-touch' features. The more I can do with literals inside lex without ever touching parse and sema the better! (for now)
AlisdairM
My inexperienced question is - when we move from lex to parse, is it
easy
to identify there is no whitespace between the literal component and the
ud-suffix, or should I be trying to take care of this in the lexing
phase?
You can check in the parser with hasLeadingSpace(), but it's probably
better to handle it in the lexer; it would massively complicate the
parser to check for a string in every single place that expects an
identifier.
Why would that be necessary? I think only ParseStringLiteral would have to
check if the string literal is immediately followed by an identifier
without space in-between.
The check for strings where an identifier is expected is only necessary if
you want to use the same technique for prefixes.
Sebastian
This is a really dangerous thing to do. I'd strongly recommend returning these as one token in the lexer. If not, you end up with wierdness like this:
#define X ud
"foo"X
(or whatever the syntax is). Splitting them up will also break token pasting and a number of other things in the preprocessor.
-Chris
Yes, that's the syntax and a good point. I think the clincher here for me though could be string concatentation:
auto x = "foo"X L"bar" "!"X;
In this case, we do regular string concatenation for the strings, ignoring the ud-suffix. If that is well-formed, we proceed to check all the ud-suffix, where present, are the same. If that is true, then we call:
operator "" X(wchar_t *, size_t) with our wide-string (in this case).
To help me understand where I am going with this though, could you explain how we would lex and tokenise:
call("foo" L"bar" "!");
Here we concatenate 3 string literals and pass the resulting literal to the function 'call'. That is analogous to applying a user-defined literal so I would like the end result (by the time we get to AST) to be as similar as possible.
AlisdairM