Almost there...

Thanks - that seemed to do the trick <g>
I had downloaded Clang before LLVM so it was not nested in the LLVM tree.

Now my last problem - I am getting a couple of compiler errors trying to
build the CodeGen library, so can't link the final clang-cc. Are these a
known issue?

...\clang\lib\CodeGen\CGBuiltin.cpp(271) : error C2039: 'BIalloca' : is not
a member of 'clang::Builtin'
...\clang\lib\CodeGen\CGBuiltin.cpp(271) : error C2065: 'BIalloca' :
undeclared identifier
...\clang\lib\CodeGen\CGBuiltin.cpp(271) : error C2051: case expression not
constant

...\clang\lib\CodeGen\CGCall.cpp(1757) : error C2039: 'NoRedZone' : is not a
member of 'llvm::Attribute'
...\clang\lib\CodeGen\CGCall.cpp(1757) : error C2065: 'NoRedZone' :
undeclared identifier

Otherwise, I look ready-to-go.

AlisdairM

AlisdairM(public) wrote:

Now my last problem - I am getting a couple of compiler errors trying to
build the CodeGen library, so can't link the final clang-cc. Are these a
known issue?

...\clang\lib\CodeGen\CGBuiltin.cpp(271) : error C2039: 'BIalloca' : is not
a member of 'clang::Builtin'
...\clang\lib\CodeGen\CGBuiltin.cpp(271) : error C2065: 'BIalloca' :
undeclared identifier
...\clang\lib\CodeGen\CGBuiltin.cpp(271) : error C2051: case expression not
constant

...\clang\lib\CodeGen\CGCall.cpp(1757) : error C2039: 'NoRedZone' : is not a
member of 'llvm::Attribute'
...\clang\lib\CodeGen\CGCall.cpp(1757) : error C2065: 'NoRedZone' :
undeclared identifier
  

Looks like you caught a broken in-between commit. Try simply updating
Clang again.
If that's not it, I'll have to look at it myself. Been a while since I
built Clang on Windows.

Sebastian

Actually, you'll need to update LLVM. They tend to be developed together (NoRedZone was added to both front end and back end in the same day), but you need separate "svn update" commands for the LLVM and Clang trees.

  - Doug

Thanks both.
The NoRedZone problem has cleared up with an LLVM update as Doug suggested.
The builtins::BIalloc issue remains, but it is a simple drop-through case
label in a switch statement, so I'm commenting it out for now and the rest
seems fine. The name looks odd anyway - does not match the usual naming
pattern in use here.

Not sure why this isn't tripping everyone else up though - does not look
Windows specific to me.

To warm up I'm going to look into implementing C++0x raw string literals, as
I expect this to be a very localised change. I think it will be easier to
do this before looking into the Unicode char types and updating the string
concatenation rules.

I expect most of this to fall inside the StringLiteralParser class, but
could someone point me to what starts the scanning for a string literal.
i.e. the code that recognised " or L" in order to invoke the string literal
parser?

Thanks muchly,
AlisdairM

Hi,

<http://lists.cs.uiuc.edu/pipermail/cfe-dev/2008-June/002054.html&gt;

To warm up I'm going to look into implementing C++0x raw string literals, as
I expect this to be a very localised change. I think it will be easier to
do this before looking into the Unicode char types and updating the string
concatenation rules.
  
I implemented raw string a long time ago, but never got to clean it up enough to commit. see:

http://lists.cs.uiuc.edu/pipermail/cfe-dev/2008-June/002054.html

and the following mails. (the latest patch should be attached to http://lists.cs.uiuc.edu/pipermail/cfe-dev/2008-June/002085.html)

There is probably some bit rot but the lexer didn't change that much, so it may be a good starting point (at least there is some comment from chris and eli). I also attached two small test cases.

regards,

Cédric

rawstring.cpp (283 Bytes)

test2.cpp (1.12 KB)

Thanks - that looks really helpful as a starting point.

I'm thinking about char16_t/char32_t and the inevitable string concatenation
as I do this, so might be a little longer than I thought. Not sure whether
it makes sense to tackle character types first of raw literals - although I
definitely think they should be incrementally supplied as two distinct
tasks!

Reason I am thinking about this now is what does a d-char mean for a
char32_t string? Assuming we can read a UTF32 formatted source file, those
UTF32 glyphs used to denote the delimiter will be transcoded down to
universal-character-names in the basic source character set, taking up to 10
characters each. That means potentially a single character delimiter from
the users' perspective.

Hmmm, that seems to be what is required today so that is what I'll
implement, although I'll probably file an issue with wg21 to see if this is
really what is intended.

AlisdairM

I think you just happened to catch a bad revision; try updating clang
to the current revision.

-Eli

AlisdairM(public) wrote:-

Reason I am thinking about this now is what does a d-char mean for a
char32_t string? Assuming we can read a UTF32 formatted source file, those

You should be able to assume the basic character set is single byte;
both C and C++ require this. So no UTF32 source files.

Something else to think about: how you track source locations if you
iconv the whole file upfront.

Neil.

AlisdairM(public) wrote:-

Reason I am thinking about this now is what does a d-char mean for a
char32_t string? Assuming we can read a UTF32 formatted source file, those

You should be able to assume the basic character set is single byte;
both C and C++ require this. So no UTF32 source files.

I don't see any connection between the basic character set and the
encoding of the source file.

Something else to think about: how you track source locations if you
iconv the whole file upfront.

Source locations ought to just point into the converted buffer, I
think; we don't need to know the byte offsets in the original file.

-Eli

Eli Friedman wrote:-

>> Reason I am thinking about this now is what does a d-char mean for a
>> char32_t string? ?Assuming we can read a UTF32 formatted source file, those
>
> You should be able to assume the basic character set is single byte;
> both C and C++ require this. ?So no UTF32 source files.

I don't see any connection between the basic character set and the
encoding of the source file.

The source character set is generally understood to be the character
set the user interacts with their terminal, editor etc.

http://www.dinkumware.com/manuals/?manual=compleat&page=charset.html

Each member of the basic character set is required to be represented
as a single byte in the source character set.

> Something else to think about: how you track source locations if you
> iconv the whole file upfront.

Source locations ought to just point into the converted buffer, I
think; we don't need to know the byte offsets in the original file.

If you're going to quote the source then you'll need to convert
back again - someone using an ISO-8859 terminal or Japanese terminal
won't want mangled UTF-8 diagnostics. Charset conversion is not
reversible in general, whether that's a practical issue is not
clear.

Apple's "interesting" decision to encode their headers in neither
ASCII nor UTF-8 will have implications too.

Neil.

The source character set is irrelevant, at least going by the C++ standard.
The very first phase of translation (C++ 2.2p1, bullet point 1) specifies
an implementation-defined mapping of physical source file characters to the
basic source character set. Making that mapping a UTF-32 to UTF-8 coding is
perfectly valid.
Interestingly enough, the standard says that any character not in the basic
set must be encoded as a ucn. That sounds impractical, so I guess since we
want to use UTF-8 internally anyway, we should make use of the as-if rule
and instead represent everything, including ucns in the original source, in
its real UTF-8 encoding.

Sebastian

> Something else to think about: how you track source locations if you
> iconv the whole file upfront.

Source locations ought to just point into the converted buffer, I
think; we don't need to know the byte offsets in the original file.

If you're going to quote the source then you'll need to convert
back again - someone using an ISO-8859 terminal or Japanese terminal
won't want mangled UTF-8 diagnostics.

We need to do this anyway: our localized Japanese diagnostics
(assuming we get some at some point) will most likely be stored in
UTF-8. Also, the terminal doesn't necessarily use the same charset as
the source file; the source charset can be overridden with
-finput-charset (or at least, that's the intention).

Charset conversion is not
reversible in general, whether that's a practical issue is not
clear.

I don't think that's an issue in practice; any reasonable charset can
be mapped to Unicode.

Apple's "interesting" decision to encode their headers in neither
ASCII nor UTF-8 will have implications too.

We probably won't bother to warn for invalid UTF-8 sequences in
comments; are the other places where there are issues?

-Eli

[Note: subject line changed]

From: cfe-dev-bounces@cs.uiuc.edu [mailto:cfe-dev-bounces@cs.uiuc.edu]
On Behalf Of Sebastian Redl
Sent: 07 June 2009 09:40
To: Neil Booth
Cc: cfe-dev@cs.uiuc.edu
Subject: Re: [cfe-dev] Almost there...

The source character set is irrelevant, at least going by the C++
standard.
The very first phase of translation (C++ 2.2p1, bullet point 1)
specifies
an implementation-defined mapping of physical source file characters to
the
basic source character set. Making that mapping a UTF-32 to UTF-8
coding is
perfectly valid.
Interestingly enough, the standard says that any character not in the
basic
set must be encoded as a ucn. That sounds impractical, so I guess since
we
want to use UTF-8 internally anyway, we should make use of the as-if
rule
and instead represent everything, including ucns in the original
source, in
its real UTF-8 encoding.

I'm putting together a HTML document that will hopefully describe current
Clang assumptions and handling of source code and encodings, together with a
set of proposals to go forward with UCNs, Unicode string literals, raw
string literals, and source files in encodings other than UTF-8. This will
be very biased towards the C++ standard requirements, although if you can
point me to specification for ObjectiveC I will take that on board. I
believe the C rules are very similar to C++ as the standards try to stay in
synch for these low level details, although I will double-check for corner
cases.

For reference, in a former life I was PM for a C++ IDE and was very
surprised at the demand from Japan for UTF-16 support in source files. The
assumption that UTF-8 would be adequate did not hold. There seemed little
demand for support for UTF-32 encoding though.

This is clearly more work than I thought I was getting into for a first
project, but if it's worth doing then it is worth doing right - and there
are a number of features bound together here that I really want a plan to
deliver as a set, even if the implementation is incremental.

I'm also trying to pull together a few more papers for the next C++
committee mailing, due in two weeks, and I guess this will be ready shortly
after that.

Issues I need to investigate right now are how/if we handle UCNs. The
impact is that a UCN will most probably take fewer characters in its string
literal representation than in the source itself, and we certainly can't
assume a 1-1 mapping of source locations to string literal representations.
Diagnostics probably will want both representations, so users get a chance
to see if their UCN character matches the glyphs they expected, while still
getting an accurate representation of the source. Likewise, we must handle
UCNs in identifiers with similar issues of reporting diagnostics. My
initial inclination for identifiers is that displaying the UCN as the
specified glyph is a job for IDEs and similar tools, and from the command
line with simply return the UCN as written in source.

AlisdairM

I'm putting together a HTML document that will hopefully describe current
Clang assumptions and handling of source code and encodings, together with a
set of proposals to go forward with UCNs, Unicode string literals, raw
string literals, and source files in encodings other than UTF-8. This will
be very biased towards the C++ standard requirements, although if you can
point me to specification for ObjectiveC I will take that on board.

There is no ObjC specification; for this sort of thing, though, it
doesn't have any special rules.

Issues I need to investigate right now are how/if we handle UCNs. The
impact is that a UCN will most probably take fewer characters in its string
literal representation than in the source itself, and we certainly can't
assume a 1-1 mapping of source locations to string literal representations.
Diagnostics probably will want both representations, so users get a chance
to see if their UCN character matches the glyphs they expected, while still
getting an accurate representation of the source.

We support UCNs in string/character literals, but not identifiers.
The AST representation translates the UCN, but we can ask the
preprocessor for locations in the original source; see
LiteralSupport.cpp for how we deal with this sort of thing.

Likewise, we must handle
UCNs in identifiers with similar issues of reporting diagnostics. My
initial inclination for identifiers is that displaying the UCN as the
specified glyph is a job for IDEs and similar tools, and from the command
line with simply return the UCN as written in source.

Hmm... not sure. If we want to allow extended glyphs in identifiers,
we probably have to canonicalize them in the AST; take the following
example:

int 風; // Directly written extended character
int \u98a8; // Same character written with UCN

If we want to accept this, they should both refer to the same object.
And if we don't accept the directly written form, there isn't much
point to accepting the UCN form except to say that we support the
standard...

-Eli