Adding more info to builtin source types.

Hello.

Currently, class BuiltinTypeLoc provides a single source location, accessible via member getNameLoc(), pointing to the name of the type specifier type (e.g., for 'unsigned int' it would return the location of 'int'). It is quite common for coders to declare builtin integral types using sign or width type specifiers only (e.g., 'unsigned', 'long', 'unsigned short', etc.). In these cases the getNameLoc() method would return an invalid source location.

The attached patch augments BuiltinTypeLoc class by adding two source locations, one for the sign specifier and the other for the width specifier. We have added methods to get/set the three source locations.
We also modified method getNameLoc() so as to return what is _likely_ to be the source location of the first type specifier token in the source code, assuming that the three specifiers are provided in the "canonical" order:
   <sign> <width> <type>

Please let us know if the patch is OK.

Cheers,
Enea Zaffanella.

BuiltinTypeLoc.patch (4.98 KB)

Hello Enea,

Currently, class BuiltinTypeLoc provides a single source location, accessible via member getNameLoc(), pointing to the name of the type specifier type (e.g., for 'unsigned int' it would return the location of 'int'). It is quite common for coders to declare builtin integral types using sign or width type specifiers only (e.g., 'unsigned', 'long', 'unsigned short', etc.). In these cases the getNameLoc() method would return an invalid source location.

What specific problem do you need to address? Do you just want getNameLoc() to have a valid location (which points to the sign or width if there is no actual type named), or do you want fine-grained information about where the width and sign keywords were actually located? The former could be done with a simple change where we build BuiltinTypeLocs (without increasing memory usage), while the latter requires more work (as you've done in this patch).

The attached patch augments BuiltinTypeLoc class by adding two source locations, one for the sign specifier and the other for the width specifier. We have added methods to get/set the three source locations.
We also modified method getNameLoc() so as to return what is _likely_ to be the source location of the first type specifier token in the source code, assuming that the three specifiers are provided in the "canonical" order:
<sign> <width> <type>

Please let us know if the patch is OK.

I have a few comments on this patch.

My biggest concern is that we're increasing the amount of storage needed for very common cases, and I'd like us to both try to minimize the amount of storage needed and to measure the actual cost of this change for some non-trivial header, to figure out how much extra memory its using in practice.

To minimize the amount of storage needed, we don't always want BuiltinTypeLoc to store 3 locations, because only the integral types actually need to store sign and width information, and that's only written in the source code occasionally. So, BuiltinTypeLoc should have a variable-length encoding dependent on the actual builtin type: types that always only have one source location (void, bool, float, double, nullptr, id, Class, SEL) should only store that location. For types that may have more than one location (e.g., the integral types), the BuiltinTypeLoc could encode which specifiers exist (in some prefix bytes) and then only have source locations for those keywords that actually show up in the source.

+ SourceRange getSourceRange() const {
+ SourceLocation loc = getNameLoc();
+ return SourceRange(loc, loc);
+ }

If we're going to keep the source locations for the type/width/sign, it'd be nice to the source range to cover the entire type. Storing the specifiers in order (with some tag saying which source location refers to which kind of keyword) would make this possible.

Of course, depending on how you answered the question at the top of my e-mail, all of this might be moot: if you just want BuiltinTypeLoc::getNameLoc() to point to some valid location for a integral type described only by its width or sign, that's easy to fix without any performance impact.

  - Doug

Douglas Gregor wrote:

Hello Enea,

Currently, class BuiltinTypeLoc provides a single source location, accessible via member getNameLoc(), pointing to the name of the type specifier type (e.g., for 'unsigned int' it would return the location of 'int'). It is quite common for coders to declare builtin integral types using sign or width type specifiers only (e.g., 'unsigned', 'long', 'unsigned short', etc.). In these cases the getNameLoc() method would return an invalid source location.

What specific problem do you need to address? Do you just want getNameLoc() to have a valid location (which points to the sign or width if there is no actual type named), or do you want fine-grained information about where the width and sign keywords were actually located? The former could be done with a simple change where we build BuiltinTypeLocs (without increasing memory usage), while the latter requires more work (as you've done in this patch).

In our specific case, the answer lies somehow in the middle.
Besides the need for a valid source location, we do also want to know
whether or not the type, sign and width specifiers were written
(no matter if directly or indirectly via macro expansions).
So, once the source location is valid, three additional bits would be
enough for our specific purposes.

The attached patch augments BuiltinTypeLoc class by adding two source locations, one for the sign specifier and the other for the width specifier. We have added methods to get/set the three source locations.
We also modified method getNameLoc() so as to return what is _likely_ to be the source location of the first type specifier token in the source code, assuming that the three specifiers are provided in the "canonical" order:
<sign> <width> <type>

Please let us know if the patch is OK.

I have a few comments on this patch.

My biggest concern is that we're increasing the amount of storage needed for very common cases, and I'd like us to both try to minimize the amount of storage needed and to measure the actual cost of this change for some non-trivial header, to figure out how much extra memory its using in practice.

To minimize the amount of storage needed, we don't always want BuiltinTypeLoc to store 3 locations, because only the integral types actually need to store sign and width information, and that's only written in the source code occasionally. So, BuiltinTypeLoc should have a variable-length encoding dependent on the actual builtin type: types that always only have one source location (void, bool, float, double, nullptr, id, Class, SEL) should only store that location. For types that may have more than one location (e.g., the integral types), the BuiltinTypeLoc could encode which specifiers exist (in some prefix bytes) and then only have source locations for those keywords that actually show up in the source.

Ok, we have made some measurements on gcc.c as an example.
This is a 20MB source that takes up as much as 284MB of memory
(as read from column RES of top(1)) when compiled using
clang -cc1 -fsyntax-only.

The memory size overhead due to the (current) patch is ~530000 bytes.

Based on what we observed, an alternative (minimal) solution using
3 bits for just the integer builtin types would reduce this overhead
to ~133000 bytes.
[NOTE: we are assuming a 4 bytes field to store these 3 bits, because:
   a) as far as we know, there are no free bits in a SourceLocation;
   b) we do not want to disrupt memory alignment by using a char.]

We can of course improve our current patch as you suggest (storing the
source locations, but only when needed): this would result in an
overhead of ~176000 bytes.

+ SourceRange getSourceRange() const {
+ SourceLocation loc = getNameLoc();
+ return SourceRange(loc, loc);
+ }

If we're going to keep the source locations for the type/width/sign, it'd be nice to the source range to cover the entire type. Storing the specifiers in order (with some tag saying which source location refers to which kind of keyword) would make this possible.

This would be desirable, but it is (to our eyes) a bit complicated.

We extract location info from DeclSpec: even though here we have a
source range, this would probably include other stuff such as the
storage class specifier or type qualifiers, which is not really
relevant for builtin types. Anyway, if you think that this is OK,
adding the source range to the BuiltinTypeLoc will be easy:
it could be combined with any of the three solutions mentioned
above (NOTE: having the source range, we could even drop all
the source locations used above, but we would still need at
least the three additional bits).

As for the ordering of the specifiers: there seems to be no info
about it in class DeclSpec.

Of course, depending on how you answered the question at the top of my e-mail, all of this might be moot: if you just want BuiltinTypeLoc::getNameLoc() to point to some valid location for a integral type described only by its width or sign, that's easy to fix without any performance impact.

    - Doug

All considered, it is mainly a matter of deciding what to do.
To our eyes, the overhead seems not to be a real issue.

Cheers,
Enea Zaffanella.

Douglas Gregor wrote:

Hello Enea,

Currently, class BuiltinTypeLoc provides a single source location, accessible via member getNameLoc(), pointing to the name of the type specifier type (e.g., for 'unsigned int' it would return the location of 'int'). It is quite common for coders to declare builtin integral types using sign or width type specifiers only (e.g., 'unsigned', 'long', 'unsigned short', etc.). In these cases the getNameLoc() method would return an invalid source location.

What specific problem do you need to address? Do you just want getNameLoc() to have a valid location (which points to the sign or width if there is no actual type named), or do you want fine-grained information about where the width and sign keywords were actually located? The former could be done with a simple change where we build BuiltinTypeLocs (without increasing memory usage), while the latter requires more work (as you've done in this patch).

In our specific case, the answer lies somehow in the middle.
Besides the need for a valid source location, we do also want to know
whether or not the type, sign and width specifiers were written
(no matter if directly or indirectly via macro expansions).
So, once the source location is valid, three additional bits would be
enough for our specific purposes.

The attached patch augments BuiltinTypeLoc class by adding two source locations, one for the sign specifier and the other for the width specifier. We have added methods to get/set the three source locations.
We also modified method getNameLoc() so as to return what is _likely_ to be the source location of the first type specifier token in the source code, assuming that the three specifiers are provided in the "canonical" order:
<sign> <width> <type>

Please let us know if the patch is OK.

I have a few comments on this patch.
My biggest concern is that we're increasing the amount of storage needed for very common cases, and I'd like us to both try to minimize the amount of storage needed and to measure the actual cost of this change for some non-trivial header, to figure out how much extra memory its using in practice.
To minimize the amount of storage needed, we don't always want BuiltinTypeLoc to store 3 locations, because only the integral types actually need to store sign and width information, and that's only written in the source code occasionally. So, BuiltinTypeLoc should have a variable-length encoding dependent on the actual builtin type: types that always only have one source location (void, bool, float, double, nullptr, id, Class, SEL) should only store that location. For types that may have more than one location (e.g., the integral types), the BuiltinTypeLoc could encode which specifiers exist (in some prefix bytes) and then only have source locations for those keywords that actually show up in the source.

Ok, we have made some measurements on gcc.c as an example.
This is a 20MB source that takes up as much as 284MB of memory
(as read from column RES of top(1)) when compiled using
clang -cc1 -fsyntax-only.

284MB sounds really high for 20MB of source :frowning:

The memory size overhead due to the (current) patch is ~530000 bytes.

Based on what we observed, an alternative (minimal) solution using
3 bits for just the integer builtin types would reduce this overhead
to ~133000 bytes.
[NOTE: we are assuming a 4 bytes field to store these 3 bits, because:
a) as far as we know, there are no free bits in a SourceLocation;
b) we do not want to disrupt memory alignment by using a char.]

Right, that makes sense.

We can of course improve our current patch as you suggest (storing the
source locations, but only when needed): this would result in an
overhead of ~176000 bytes.

I actually like the idea of using 3 bits to say which of type/signedness/width were specified in the type, then just making sure that the SourceLocation points to something that makes sense... the type specifier location if it's there, otherwise the signedness or width location. That gives us a lot more information in the AST without costing much at all.

+ SourceRange getSourceRange() const {
+ SourceLocation loc = getNameLoc();
+ return SourceRange(loc, loc);
+ }
If we're going to keep the source locations for the type/width/sign, it'd be nice to the source range to cover the entire type. Storing the specifiers in order (with some tag saying which source location refers to which kind of keyword) would make this possible.

This would be desirable, but it is (to our eyes) a bit complicated.

We extract location info from DeclSpec: even though here we have a
source range, this would probably include other stuff such as the
storage class specifier or type qualifiers, which is not really
relevant for builtin types. Anyway, if you think that this is OK,
adding the source range to the BuiltinTypeLoc will be easy:
it could be combined with any of the three solutions mentioned
above (NOTE: having the source range, we could even drop all
the source locations used above, but we would still need at
least the three additional bits).

As for the ordering of the specifiers: there seems to be no info
about it in class DeclSpec.

Ah, okay. Let's not worry about the source range; a single *useful* location is good enough.

Of course, depending on how you answered the question at the top of my e-mail, all of this might be moot: if you just want BuiltinTypeLoc::getNameLoc() to point to some valid location for a integral type described only by its width or sign, that's easy to fix without any performance impact.
   - Doug

All considered, it is mainly a matter of deciding what to do.
To our eyes, the overhead seems not to be a real issue.

I'm a bit more paranoid about memory overhead and PCH size.

  - Doug

Douglas Gregor wrote:

[...]

I actually like the idea of using 3 bits to say which of type/signedness/width were specified in the type, then just making sure that the SourceLocation points to something that makes sense... the type specifier location if it's there, otherwise the signedness or width location. That gives us a lot more information in the AST without costing much at all.

[...]

Ah, okay. Let's not worry about the source range; a single *useful* location is good enough.

Here is a revised patch.

We store a single location for all builtin types where you cannot write a width/sign specifier (void, char16, etc.).

For the others builtins, we have additional 4 bytes that are used to store a total of 10 bits (instead of the 3 mentioned above): we found that out initial analysis was incomplete in that we were not considering, e.g., the possibility of having a 'mode' attribute specifying the width of the builtin. Hence now we store:
  - the TST (5 bits), TSW (2 bits) and TSS (2 bits) fields
    **before** these are modified by the semantic analysis done
    in DeclSpec::Finish();
  - in the 10th bit, whether or not the ModeAttr was given
    **before** this is taken away by TakeAttributes().

Cheers,
Enea Zaffanella.

BuiltinTypeLoc.patch (11.5 KB)

Hello Enea,

Douglas Gregor wrote:

Hello Enea,

[...]

Okay, great. I committed your patch here:

    http://lists.cs.uiuc.edu/pipermail/cfe-commits/Week-of-Mon-20100118/026437.html

with a couple of tweaks:

[...]

+ bool needsExtraLocalData() const {
+ BuiltinType::Kind bk = getTypePtr()->getKind();
+ return (bk >= BuiltinType::UShort && bk <= BuiltinType::UInt128)
+ || (bk >= BuiltinType::Short && bk <= BuiltinType::LongDouble)
+ || bk == BuiltinType::UChar
+ || bk == BuiltinType::SChar;
+ }

"float", "double", and "long double" don't need any extra data, since we'll always have a location for the type specifier (float or double) and the type is never implied by either a sign or a width.

Actually, the use of 'mode' attributes can change the width of floating point types, hence the need to store, even in this case, the written type specifier, the written width specifier and a Boolean flag for the presence of any 'mode' attribute.

As an example, 'float' is translated by 'XF' to 'long double':

$ cat a.c
float f __attribute__((mode(XF)));

$ clang -cc1 -ast-dump a.c
typedef __int128_t __int128_t;
typedef __uint128_t __uint128_t;
struct __va_list_tag {
     unsigned int gp_offset;
     unsigned int fp_offset;
     void *overflow_arg_area;
     void *reg_save_area;
};
typedef struct __va_list_tag __va_list_tag;
typedef __va_list_tag __builtin_va_list[1];
long double f;

You are right that, in this case, a sign specifier is not really needed ... but this anyway incurs no additional space overhead.

All the other changes to my patch are plainly right.

Cheers,
Enea Zaffanella.

Oh, I understand now, thanks. Fixed here:

  http://lists.cs.uiuc.edu/pipermail/cfe-commits/Week-of-Mon-20100118/026449.html

  - Doug