Source information for types

Hi clang community,

I'd like to make a proposal for keeping source location information for types in the AST, to allow functionality like "find the source locations where this typedef is used".
Type source info will be provided using a flag so the memory overhead involved will be only for clients that need that information.
Using a flag will also make it easy to find out what exactly the overhead is when type source info is enabled.

Ideally, since this is a feature useful only for specific clients, we don't want to impose any overhead on the non-type-source-info ASTs.
Here's the plan:

SourceType:
There will be a new, non canonical, not uniqued, type 'SourceType', whose purpose will be to store source location information for a type-specifier.
It will also contain the 'base' QualType of the type-specifier.
The exact amount of source info that it will keep is not important at this point, it can be discussed/debated after SourceType is introduced.
The owner of SourceType objects will be ASTContext.

TypeSpecifier:
TypeSpecifier will be a very thin wrapper over a QualType. It will have a getType() method that checks whether QualType is a SourceType,
and if it is, it will return the 'base' QualType of SourceType, otherwise it will return the original QualType that it wraps.
It will also have isSourceType() and getSourceType() methods.

TypeSpecifier will replace the QualType fields in the Decls. For example, ValueDecl will contain a TypeSpecifier instead of a plain QualType.
"QualType ValueDecl::getType()" will delegate to TypeSpecifier::getType() (ValueDecl::getType() will still return a QualType).
TypeSpecifier will also be contained in Exprs that deal with type-specifiers, like SizeOfAlignOfExpr.

Since getType() of Decls/Exprs will never return a QualType that is actually a SourceType, SourceType won't enter the semantic type checking code paths.
You would have to specifically ask for the TypeSpecifier through ValueDecl::getTypeSpec() to get at the SourceType.

The end result is that for the normal non-type-source-info case there will be no memory overhead at all.

Do you find the above reasonable ?
Any questions/comments/suggestions will be greatly appreciated.

-Argiris

I'm pretty sure you can do this already: you can recurse into a type
to find typedefs, getTypeSpecStartLoc returns the start of the
declaration specifiers, and you can re-lex the source and figure out
the precise location of the typedef name from there. It might be nice
to provide utility methods to do this, but I don't see any need for
any core changes.

Also, your suggestion doesn't really address where exactly you plan to
store your flag; there aren't any spare bits in QualType.

-Eli

I'd like to make a proposal for keeping source location information
for types in the AST, to allow functionality like "find the source
locations where this typedef is used".

I'm pretty sure you can do this already: you can recurse into a type
to find typedefs, getTypeSpecStartLoc returns the start of the
declaration specifiers, and you can re-lex the source and figure out
the precise location of the typedef name from there. It might be nice
to provide utility methods to do this, but I don't see any need for
any core changes.

Might be simple for a single typedef name but it doesn't work well with C++ types like this:

std::vector< my_template< my_namespace::my_type, my_variable > >

Ideally all the type source info for the above type would be readily available from the AST; recomputing all the information from the tokens isn't practical, Sema already did the hard work, we just need to store it.

Also, your suggestion doesn't really address where exactly you plan to
store your flag; there aren't any spare bits in QualType.

The flag is just a boolean to pass to Sema to tell it whether to create SourceTypes or not. If the flag is false, Sema will discard the source info and just pass QualTypes to the Decls, as it currently does.

-Argiris

Ideally all the type source info for the above type would be readily
available from the AST; recomputing all the information from the tokens
isn't practical, Sema already did the hard work, we just need to store it.

Hmm... are you planning to make the SourceType recursive?

Also, your suggestion doesn't really address where exactly you plan to
store your flag; there aren't any spare bits in QualType.

The flag is just a boolean to pass to Sema to tell it whether to create
SourceTypes or not. If the flag is false, Sema will discard the source info
and just pass QualTypes to the Decls, as it currently does.

Oh, I see; it would vary with a flag on the ASTContext rather than the
individual Decls.

-Eli

Ideally all the type source info for the above type would be readily

available from the AST; recomputing all the information from the tokens

isn’t practical, Sema already did the hard work, we just need to store it.

Hmm… are you planning to make the SourceType recursive?

Possibly but no concrete plans about C++ yet. Let’s start focusing on the single typedef case for now :slight_smile:

Also, your suggestion doesn’t really address where exactly you plan to

store your flag; there aren’t any spare bits in QualType.

The flag is just a boolean to pass to Sema to tell it whether to create

SourceTypes or not. If the flag is false, Sema will discard the source info

and just pass QualTypes to the Decls, as it currently does.

Oh, I see; it would vary with a flag on the ASTContext rather than the
individual Decls.

Yes exactly; but note that whether a TypeSpecifier wraps a ‘plain’ QualType or a QualType containing a SourceType can be varied and checked using TypeSpecifier::isSourceType().

This is basically what the TypeSpecifier would look like:

class TypeSpecifier {
QualType Ty;

public:
explicit TypeSpecifier(QualType type) : Ty(type) { }
TypeSpecifier(SourceType *sourceTy) : Ty(QualType(sourceTy, 0)) { }

QualType getType() const {
if (SourceType *ST = dyn_cast_or_null(Ty.getTypePtr()))
return ST->getBaseType();
return Ty;
}

bool isSourceType() const {
return dyn_cast_or_null(Ty.getTypePtr()) != 0;
}
SourceType *getSourceType() const { return cast(Ty); }
};

-Argiris

Ah, so SourceType inherits from Type? That feels slightly nasty, but
it works, I guess.

-Eli

Yes, it will be a Type subclass that the type system should not deal with and should not be aware of (e.g. CodeGen will never come across a SourceType).
The benefit is that the size of Decls remain the same (no need to add a pointer to a "type source info" object).

-Argiris

Hi Argiris,

I always thought we'd do this sort of thing by introducing several new subclasses of type, but ones that are very specific to the various types. For example, in an ArrayType, you really want to store the location of the [ and ], and the "Expr*" of the size. To do this, we'd have something like:

class ConstantArrayTypeWithLoc : public ConstantArrayType {
   SourceLocation LBracketLoc, RBracketLoc;
   Expr *Size;
...
};

The canonical form of the array type would be the normal ConstantArray without location info (so that canonical types are pointer unique as usual). This just provides loc info for ConstantArray, so we'd need to have a per-class subclass for every type that has a token. Does this approach make sense?

-Chris

Chris Lattner wrote:

I always thought we'd do this sort of thing by introducing several new
subclasses of type, but ones that are very specific to the various
types.
The canonical form of the array type would be the normal ConstantArray
without location info (so that canonical types are pointer unique as
usual). This just provides loc info for ConstantArray, so we'd need
to have a per-class subclass for every type that has a token. Does
this approach make sense?
  

That was my idea for it as well, until I discarded it because I confused
the Type and Decl hierarchies. (The decl hierarchy is deeper, leading to
more redundant work.)

Sebastian

How are decls involved?

-Chris

They aren't. I was just confused. I think your way would work great.

Sebastian

I think it would work, but Doug came up with a better approach. If he's nice, maybe he'll send out a summary of his approach tomorrow :slight_smile:

-Chris

Why not ? Right now CodeGen uses
  Ty->getDecl()->getLocation();
to find the type def location for debug info entries.

If you have:

typedef int foo; // #1
foo x; #2
foo y; #3

Isn't the interesting location for debug info the one on #1 ? (the location where 'foo' is declared)

Is there a reason for the debugger to know where exactly 'foo' was written inside the 'x' and 'y' declarations ? (in #2 and #3)

-Argiris

No, the debugger just wants to know where the decls are.

-Chris

Hello.

Argyrios Kyrtzidis wrote:

Hi clang community,

I'd like to make a proposal for keeping source location information for types in the AST, to allow functionality like "find the source locations where this typedef is used".

[...snip...]

TypeSpecifier will replace the QualType fields in the Decls. For example, ValueDecl will contain a TypeSpecifier instead of a plain QualType.
"QualType ValueDecl::getType()" will delegate to TypeSpecifier::getType () (ValueDecl::getType() will still return a QualType).
TypeSpecifier will also be contained in Exprs that deal with type- specifiers, like SizeOfAlignOfExpr.

[...]

As clang users, we would be really interested in such a proposal (or a functionally equivalent one) and hence we would like to know if there have been progresses ... or there are plans in this respect.

Right now, we are facing problems when trying,
e.g., to distinguish the following two cases:

void foo() {
     /* ... */
     int a[sizeof(struct S*)];
     int b[sizeof(struct S {int a;}*)]
     /* ... */
}

from

void bar() {
     /* ... */
     int a[sizeof(struct S {int a;}*)]
     int b[sizeof(struct S*)];
     /* ... */
}

That is, by looking at the AST only it is quite difficult (impossible?) to understand whether the `struct S' was defined inside the declaration of `a' or that of `b'. Unless I am missing something, both pointer types have no associated location info and refer to the very same RecordDecl node. Even though the example above doesn't look really interesting, variations can be built that are relevant, e.g., for applications that need to check strict adherence to user-defined coding standards.

Another issue that could be solved by following this proposal is the ability to distinguish different syntactic representations of the same type; namely, distinguishing "unsigned" from "unsigned int" from "int unsigned" and the like.

As yet another example, we would like to be able to distinguish between
   void (*pf)(int left_is_array, int* right_is_pointer);
vs
   void (*pf)(int* left_is_pointer, int right_is_array);

[Note: since pf is a pointer to function, there are no parameters, hence we cannot keep track of the parameter names nor we can retrieve the original type from OriginalParmVarDecl].

In our opinion, re-lexing the type specifier is not an option.

Cheers,
Enea Zaffanella.

Hi Enea,

Hello.

Argyrios Kyrtzidis wrote:

Hi clang community,
I'd like to make a proposal for keeping source location information for types in the AST, to allow functionality like "find the source locations where this typedef is used".

[...snip...]

TypeSpecifier will replace the QualType fields in the Decls. For example, ValueDecl will contain a TypeSpecifier instead of a plain QualType.
"QualType ValueDecl::getType()" will delegate to TypeSpecifier::getType () (ValueDecl::getType() will still return a QualType).
TypeSpecifier will also be contained in Exprs that deal with type- specifiers, like SizeOfAlignOfExpr.

[...]

As clang users, we would be really interested in such a proposal (or a functionally equivalent one) and hence we would like to know if there have been progresses ... or there are plans in this respect.

Yes, I quite recently started working on this feature.

Right now, we are facing problems when trying,
e.g., to distinguish the following two cases:

void foo() {
   /* ... */
   int a[sizeof(struct S*)];
   int b[sizeof(struct S {int a;}*)]
   /* ... */
}

from

void bar() {
   /* ... */
   int a[sizeof(struct S {int a;}*)]
   int b[sizeof(struct S*)];
   /* ... */
}

That is, by looking at the AST only it is quite difficult (impossible?) to understand whether the `struct S' was defined inside the declaration of `a' or that of `b'. Unless I am missing something, both pointer types have no associated location info and refer to the very same RecordDecl node. Even though the example above doesn't look really interesting, variations can be built that are relevant, e.g., for applications that need to check strict adherence to user-defined coding standards.

This is exactly the kind of information we want to encode in type source info.

Another issue that could be solved by following this proposal is the ability to distinguish different syntactic representations of the same type; namely, distinguishing "unsigned" from "unsigned int" from "int unsigned" and the like.

What are the use cases for this, do you also need it to check adherence to coding standards ?

As yet another example, we would like to be able to distinguish between
void (*pf)(int left_is_array, int* right_is_pointer);
vs
void (*pf)(int* left_is_pointer, int right_is_array);

[Note: since pf is a pointer to function, there are no parameters, hence we cannot keep track of the parameter names nor we can retrieve the original type from OriginalParmVarDecl].

In our opinion, re-lexing the type specifier is not an option.

We will definitely handle this. The way I see it, you'll be able to get at the parameters through the type source info (and look into the parameter's type info too).

-Argiris

Argyrios Kyrtzidis wrote:

Hi Enea,

Hello.

Argyrios Kyrtzidis wrote:

Hi clang community,
I'd like to make a proposal for keeping source location information for types in the AST, to allow functionality like "find the source locations where this typedef is used".

[...snip...]

TypeSpecifier will replace the QualType fields in the Decls. For example, ValueDecl will contain a TypeSpecifier instead of a plain QualType.
"QualType ValueDecl::getType()" will delegate to TypeSpecifier::getType () (ValueDecl::getType() will still return a QualType).
TypeSpecifier will also be contained in Exprs that deal with type- specifiers, like SizeOfAlignOfExpr.

[...]

As clang users, we would be really interested in such a proposal (or a functionally equivalent one) and hence we would like to know if there have been progresses ... or there are plans in this respect.

Yes, I quite recently started working on this feature.

This is really good news!

[...]

Another issue that could be solved by following this proposal is the ability to distinguish different syntactic representations of the same type; namely, distinguishing "unsigned" from "unsigned int" from "int unsigned" and the like.

What are the use cases for this, do you also need it to check adherence to coding standards ?

Yes. (In this particular case, I suspect that coding standard checkers are the only application.)

As yet another example, we would like to be able to distinguish between
void (*pf)(int left_is_array, int* right_is_pointer);
vs
void (*pf)(int* left_is_pointer, int right_is_array);

[Note: since pf is a pointer to function, there are no parameters, hence we cannot keep track of the parameter names nor we can retrieve the original type from OriginalParmVarDecl].

In our opinion, re-lexing the type specifier is not an option.

We will definitely handle this. The way I see it, you'll be able to get at the parameters through the type source info (and look into the parameter's type info too).

-Argiris

Well, many many thanks in advance.

Cheers,
Enea.

Is re-lexing the type specifier a reasonable solution for this ?
My concern is the impact on memory consumption for type source info. Ideally memory usage for type source info should be reasonable enough to consider having them on all the time.

-Argiris

Argyrios Kyrtzidis wrote:

Another issue that could be solved by following this proposal is the ability to distinguish different syntactic representations of the same type; namely, distinguishing "unsigned" from "unsigned int" from "int unsigned" and the like.

What are the use cases for this, do you also need it to check adherence to coding standards ?

Yes. (In this particular case, I suspect that coding standard checkers are the only application.)

Is re-lexing the type specifier a reasonable solution for this ?
My concern is the impact on memory consumption for type source info. Ideally memory usage for type source info should be reasonable enough to consider having them on all the time.

-Argiris

Let us see: according to the C standard, you can freely reorder type specifiers and mix them with type qualifiers (which can occur several times) and storage class specifiers (which can occur only once).
So, the worst cases should look like the following:

void foo() {
   /* ... */
   const long const int extern const unsigned volatile long *p1;

   const long const int const unsigned typedef volatile long * PType;
   const PType p2 = p1;

   p1 = (int long const unsigned volatile long * restrict) p2;
}

Well, things would be easy, provided we can see the sequence of preprocessed tokens. Because the tokens above may be the result of (several levels of) macro expansions ... here is where I am afraid I am missing knowledge about the underlying clang infrastructure, so I need some help.

Is it possible to start the re-lexing from the source location of a particular token (e.g., the TypeSpecStartLoc of the declaration of p1 above) so as to obtain the exact sequence of the preprocessed tokens?
Can you point me to some existing code snippet where such a re-lexing is done?

Thank you in advance for the help,
Enea.