A small proposal about TagType's and source ranges

We are struggling to fix lack of proper source location range for
TagType's due to missing setting of both source location for tag keyword
and name.

We'd like to propose the following simplification of current approach to
tag types and elaborated types storing:

Currently clang uses (only in C++) an ElaboratedType whenever the type
has an explicit class/struct/union/enum keyword *or* a name qualifier.

We propose to use an ElaboratedType *only* whenever there is a name
qualifier and to add to TagTypeLoc a (possibly invalid) tag keyword
location (removing the one now available in ElaboratedTypeLoc).

This permit to avoid a lot of elaborated types in C++, to uniform C and
C++ and to permit to have proper source ranges on both C and C++ (in C
this is currently impossible).

Personally I don't see any drawback and the needed changes are trivial,
however I'd like to hear core developers opinion before to proceed.

If you do that, you should rename ElaboratedType, because the term comes from the C++ standard and means "a tag type named with its tag keyword". (dcl.type.elab).

Sebastian

Yes, I've thought to call it QualifiedNameType (as it was called 2 years
ago before to be mixed with ElaboratedType).

How would we represent "struct A::B" in C++?

It's also important to distinguish between A and struct A in C++. If we don't have an elaborated type node, what remembers this distinction?

John.

On the syntactic side they can be distinguished in the TagTypeLoc using
the validity of tag keyword location. On the semantic side they can be
distinguished inheriting TagType from TypeWithKeyword.

Adding non-canonical state to TagType is not really acceptable; it's really,
really nice right now that TagTypes are always canonical.

John.

If you prefer in this way the inheritance from TypeWithKeyword can be
avoided. Do we need to distinguish between "S" and "struct S" in places
where TypeLoc is not available?

Generally not, but I'd rather not rely on it. Wouldn't the better solution
to have some mode which enables ElaboratedTypes even in C?

John.

Of course this is a possibility, but IMHO I don't think it is a good
idea to increase the memory footprint for a need that is not (or not
yet) materialized.

Let me do a little summary of presented alternatives:

C language:

Alternative 1: struct S is an ElaboratedType referring a RecordType, the
canonical type of both is that RecordType

Alternative 2a and 2b: struct S is a RecordType whose canonical type is
itself

C++ language:

Alternative 1: struct S is an ElaboratedType referring a Recordtype. the
canonical type of both is that RecordType

Alternative 2a: struct S is a RecordType whose canonical type is another
RecordType with tag kind bitfield set to None

Alternative 2b: struct S is a Recordtype whose canonical type is itself
and the only way to know if a tag keyword was present is to consult
RecordTypeLoc

I've to say that I don't see any benefits of 1 wrt 2.

About the differences between 2a and 2b I think they are not so
relevant: for C in 100% of cases the canonical type of a RecordType is
itself and for C++ this is true in 95% of cases.

Wouldn't the better solution to have some mode which enables
ElaboratedTypes even in C?

Of course this is a possibility, but IMHO I don't think it is a good
idea to increase the memory footprint for a need that is not (or not
yet) materialized.

Let's be clear here: the memory footprint is going to increase with this,
just as it's increased with all the other source range changes you've made
recently. I expect that the increase in memory footprint from allocating
a uniqued ElaboratedType per RecordType in C is going to be quite small
when compared with storing all the extra source locations. One of the nice
things about a mode to enable ElaboratedTypes is that we have the
option of disabling *both* of those costs.

Let me do a little summary of presented alternatives:

C language:

Alternative 1: struct S is an ElaboratedType referring a RecordType, the
canonical type of both is that RecordType

Alternative 2a and 2b: struct S is a RecordType whose canonical type is
itself

C++ language:

Alternative 1: struct S is an ElaboratedType referring a Recordtype. the
canonical type of both is that RecordType

Alternative 2a: struct S is a RecordType whose canonical type is another
RecordType with tag kind bitfield set to None

Alternative 2b: struct S is a Recordtype whose canonical type is itself
and the only way to know if a tag keyword was present is to consult
RecordTypeLoc

I've to say that I don't see any benefits of 1 wrt 2.

You don't see changing the assumption, made throughout the compiler,
that a TagType is always canonical as having any cost?

About the differences between 2a and 2b I think they are not so
relevant: for C in 100% of cases the canonical type of a RecordType is
itself and for C++ this is true in 95% of cases.

Something being usually true does not actually free us of the obligation
of checking and compensating for the possibility that it is not.

John.

Wouldn't the better solution to have some mode which enables
ElaboratedTypes even in C?

Of course this is a possibility, but IMHO I don't think it is a good
idea to increase the memory footprint for a need that is not (or not
yet) materialized.

Let's be clear here: the memory footprint is going to increase with this,
just as it's increased with all the other source range changes you've made
recently. I expect that the increase in memory footprint from allocating
a uniqued ElaboratedType per RecordType in C is going to be quite small
when compared with storing all the extra source locations. One of the nice
things about a mode to enable ElaboratedTypes is that we have the
option of disabling *both* of those costs.

To have precise source ranges is a design goal for clang (and a must for
several applications, ours included), so we try (as always done) to
achieve this trying to keep needed memory as small as possibile.

C language:

Alternative 1: struct S is an ElaboratedType referring a RecordType, the
canonical type of both is that RecordType

Alternative 2a and 2b: struct S is a RecordType whose canonical type is
itself

C++ language:

Alternative 1: struct S is an ElaboratedType referring a Recordtype. the
canonical type of both is that RecordType

Alternative 2a: struct S is a RecordType whose canonical type is another
RecordType with tag kind bitfield set to None

Alternative 2b: struct S is a Recordtype whose canonical type is itself
and the only way to know if a tag keyword was present is to consult
RecordTypeLoc

I've to say that I don't see any benefits of 1 wrt 2.

You don't see changing the assumption, made throughout the compiler,
that a TagType is always canonical as having any cost?

I was under the impression that to get the canonical type of a type had
a negligible cost. Consider however that alternative 2b does not change
this assumption.

About the differences between 2a and 2b I think they are not so
relevant: for C in 100% of cases the canonical type of a RecordType is
itself and for C++ this is true in 95% of cases.

Something being usually true does not actually free us of the obligation
of checking and compensating for the possibility that it is not.

Once read your preferences I can make another proposal:

Alternative 3:

C language: struct S is a RecordType (always canonical)

C++ language: struct S is an ElaboratedType (always non canonical) with
an inner RecordType (always canonical)

This would be exactly like it is now, but KeywordLoc would be moved
from ElaboratedTypeLoc to RecordTypeLoc so to be available for both C
and C++.

This solution does not have minimal memory needs like 2b (that I confess
is still my pet :wink: but has the lowest impact wrt current
implementation, does it matches your expectations?

Wouldn't the better solution to have some mode which enables
ElaboratedTypes even in C?

Of course this is a possibility, but IMHO I don't think it is a good
idea to increase the memory footprint for a need that is not (or not
yet) materialized.

Let's be clear here: the memory footprint is going to increase with this,
just as it's increased with all the other source range changes you've made
recently. I expect that the increase in memory footprint from allocating
a uniqued ElaboratedType per RecordType in C is going to be quite small
when compared with storing all the extra source locations. One of the nice
things about a mode to enable ElaboratedTypes is that we have the
option of disabling *both* of those costs.

To have precise source ranges is a design goal for clang (and a must for
several applications, ours included), so we try (as always done) to
achieve this trying to keep needed memory as small as possibile.

Having precise source locations is certainly a desirable goal, but it does
need to be balanced against other goals like keeping memory usage
reasonable. For example, we could theoretically store the locations of
semicolons and commas, but we don't, because we assume it's possible to
discover them by re-lexing from a known location.

I'm not saying that elaborated types in C are necessarily in this category,
particularly since (unlike commas) they *precede* the known location,
but perfect source location fidelity is, in the end, one goal among many.
Case in point, I understood your reasoning for adding InnerLocStart to
DeclaratorDecl, but that was a noticeable inflation in AST size for the
benefit of a fairly small number of consumers.

You don't see changing the assumption, made throughout the compiler,
that a TagType is always canonical as having any cost?

I was under the impression that to get the canonical type of a type had
a negligible cost. Consider however that alternative 2b does not change
this assumption.

There are several core algorithms (like getAs) which change depending on
whether a type can be sugar, although maybe this proposal's changes
would be fairly mild on them. Having to worry about multiple RecordTypes
per canonical decl would introduce some non-trivial costs, though.

Once read your preferences I can make another proposal:

Alternative 3:

C language: struct S is a RecordType (always canonical)

C++ language: struct S is an ElaboratedType (always non canonical) with
an inner RecordType (always canonical)

This would be exactly like it is now, but KeywordLoc would be moved
from ElaboratedTypeLoc to RecordTypeLoc so to be available for both C
and C++.

This would be quite expensive for C++, which uses unelaborated class
types a lot — in addition to the obvious uses, recall that we store a
TypeLoc for every constructor or destructor name.

John.

Wouldn't the better solution to have some mode which enables
ElaboratedTypes even in C?

Of course this is a possibility, but IMHO I don't think it is a good
idea to increase the memory footprint for a need that is not (or not
yet) materialized.

Let's be clear here: the memory footprint is going to increase with this,
just as it's increased with all the other source range changes you've made
recently. I expect that the increase in memory footprint from allocating
a uniqued ElaboratedType per RecordType in C is going to be quite small
when compared with storing all the extra source locations. One of the nice
things about a mode to enable ElaboratedTypes is that we have the
option of disabling *both* of those costs.

To have precise source ranges is a design goal for clang (and a must for
several applications, ours included), so we try (as always done) to
achieve this trying to keep needed memory as small as possibile.

Having precise source locations is certainly a desirable goal, but it does
need to be balanced against other goals like keeping memory usage
reasonable. For example, we could theoretically store the locations of
semicolons and commas, but we don't, because we assume it's possible to
discover them by re-lexing from a known location.

... this is a big surprise for me. In the general case how you'd cope
with macro expansion?

Once read your preferences I can make another proposal:

Alternative 3:

C language: struct S is a RecordType (always canonical)

C++ language: struct S is an ElaboratedType (always non canonical) with
an inner RecordType (always canonical)

This would be exactly like it is now, but KeywordLoc would be moved
from ElaboratedTypeLoc to RecordTypeLoc so to be available for both C
and C++.

This would be quite expensive for C++, which uses unelaborated class
types a lot — in addition to the obvious uses, recall that we store a
TypeLoc for every constructor or destructor name.

Ok, then.

We will proceed according your guidelines. I resume it here to be sure
to have not equivocated:

- use ElaboratedType also for C
- obtain source range of ElaboratedType using the keyword location
stored in ElaboratedTypeLoc and the name location available in RecordTypeLoc

Badly, I expect. I believe the thinking is that we would rather have an API
which relexes the entire translation unit up to a point than to waste what's
quite likely to be dozens of megabytes per translation unit storing
unnecessary commas, equals signs, and semicolons.

We don't currently have such an API only because nobody has written it
yet, which is probably related to the fact that it's hard to imagine why anyone
would both (1) have an interest in that level of source information *and*
(2) be able to proceed with their intended operation in the face of macros.
Certainly most refactoring engines have no business operating within
macros.

John.

Committed in r127755.