Start of source range for Decl nodes.

Hello.

In our application we are facing a problem related to the quality of
source location information for Decl nodes. When using method
    Decl::getSourceRange()
we noticed a few cases where we obtain not so accurate location info for
the start of the range.
Here are a few examples.

For DeclaratorDecl nodes:
  const int a;
  static int b;
  inline int foo() { return 5; }
the range starts from "int", rather than "const" or "static" or "inline".

For a NamespaceDecl node such as
  namespace N { }
the range starts from the identifier N, rather than from the "namespace"
keyword. Similarly for a linkage specification such as
  extern "C" { ... }

All of these minor problems can be easily fixed as soon as there will be
a place where the start for the range can be recorded. For instance,
when creating a new VarDecl node in SemaDecl.cpp:2956

    NewVD = VarDecl::Create(Context, DC, D.getIdentifierLoc(),
                            II, R, TInfo, SC, SCAsWritten);

the Declarator object D could provide accurate source range info, but
the VarDecl constructor has no parameter accepting the start of such a
range: it has a SourceLocation for the identifier loc and a
TypeSourceInfo, which does not take into account syntactic elements such
as storage classes, CVR qualifiers, inline specifiers, etc.
As a result, we obtain strange-looking differences whereby declaration
    int const a;
has an accurate source range, whereas
    const int a;
is skipping the const keyword.

Assuming that the goal of obtaining precise source location info should
be achieved by all means, now the question is: which is best place where
this additional source location (the start of the range) for the Decl
hierarchy should be stored?
Would it be OK to store them at the top of the hierarchy?
Are there better places?

Regards,
Enea.

I'd rather that we only add this information to those AST nodes that need it.

Also, the source range for a DeclaratorDecl (and anything else that stares with declaration specifiers) gets messy when there are multiple declarations, e.g.,

  static struct Point { int x, y; } p1, *p2, get_point(int x, int y);

Who claims the "static" keyword? Point? p1? Everyone?

If this occurred inside a DeclStmt, we'd have a DeclGroup to help us sort through this. At namespace scope, we don't have the equivalent notion.

  - Doug

Doug,

you may remember my request for something similar, via libclang. To be able to extract inline documentation from source code, I also need to find the source range not only for a particular declarator, but also the entire declaration. Right now there is no way to get to that via libclang, as declarations aren't represented by cursors.

Thanks,
         Stefan

Assuming that the goal of obtaining precise source location info should
be achieved by all means, now the question is: which is best place where
this additional source location (the start of the range) for the Decl
hierarchy should be stored?
Would it be OK to store them at the top of the hierarchy?
Are there better places?

I'd rather that we only add this information to those AST nodes that need it.

Also, the source range for a DeclaratorDecl (and anything else that stares with declaration specifiers) gets messy when there are multiple declarations, e.g.,

  static struct Point { int x, y; } p1, *p2, get_point(int x, int y);

Who claims the "static" keyword? Point? p1? Everyone?

If this occurred inside a DeclStmt, we'd have a DeclGroup to help us sort through this. At namespace scope, we don't have the equivalent notion.

Doug,

you may remember my request for something similar, via libclang.

Yes, this is a longstanding representational issue. Discussions of how to deal with it go back more than 2 years, but nobody's actually gone ahead and fixed the problems.

To be
able to extract inline documentation from source code, I also need to
find the source range not only for a particular declarator, but also the
entire declaration.

Nailing down what this actually means is fairly important for the discussion of representation. What's the source range for the entire declaration of "p2" in my example? Does it include the unrelated declarator p1? Does it include get_point?

Right now there is no way to get to that via
libclang, as declarations aren't represented by cursors.

The source range provided by libclang covers the declarator for each declaration, but not the specifiers common to all of the declarators.

  - Doug

Yes, I think it's best to stay close to the C++ grammar in that respect: The above contains a single (toplevel) declaration, with 4 (toplevel) declarators. There already is a VarDecl cursor for 'p2', whose extent is just the 'p2' token itself. The declaration this is part of then is the same for all four declarators, with the same extent, from 'static' to ';'.

Thanks,
         Stefan

This is what DeclStmts essentially do via the DeclGroup mechanism. If that notion were extended to work in namespace/class/etc. contexts, we could leverage it in libclang and elsewhere.

  - Doug

Hello.

In our application we are facing a problem related to the quality of
source location information for Decl nodes. When using method
   Decl::getSourceRange()
we noticed a few cases where we obtain not so accurate location info for
the start of the range.
Here are a few examples.

For DeclaratorDecl nodes:
const int a;
static int b;
inline int foo() { return 5; }
the range starts from "int", rather than "const" or "static" or "inline".

For a NamespaceDecl node such as
namespace N { }
the range starts from the identifier N, rather than from the "namespace"
keyword. Similarly for a linkage specification such as
extern "C" { ... }

All of these minor problems can be easily fixed as soon as there will be
a place where the start for the range can be recorded. For instance,
when creating a new VarDecl node in SemaDecl.cpp:2956

   NewVD = VarDecl::Create(Context, DC, D.getIdentifierLoc(),
                           II, R, TInfo, SC, SCAsWritten);

the Declarator object D could provide accurate source range info, but
the VarDecl constructor has no parameter accepting the start of such a
range: it has a SourceLocation for the identifier loc and a
TypeSourceInfo, which does not take into account syntactic elements such
as storage classes, CVR qualifiers, inline specifiers, etc.
As a result, we obtain strange-looking differences whereby declaration
   int const a;
has an accurate source range, whereas
   const int a;
is skipping the const keyword.

Assuming that the goal of obtaining precise source location info should
be achieved by all means, now the question is: which is best place where
this additional source location (the start of the range) for the Decl
hierarchy should be stored?
Would it be OK to store them at the top of the hierarchy?
Are there better places?

I'd rather that we only add this information to those AST nodes that need it.

I've proposed to put it in Decl as I'm unable to find an AST node that
has no need to store start source location...

I'm missing something?

Also, the source range for a DeclaratorDecl (and anything else that stares with declaration specifiers) gets messy when there are multiple declarations, e.g.,

  static struct Point { int x, y; } p1, *p2, get_point(int x, int y);

Who claims the "static" keyword? Point? p1? Everyone?

IMHO opinion there is no doubts that the full source range of p2 is from
"static" to "p2" and that the full source range of "get_point" is from
static to ")".

The concept underlying the source range can be easily defined as the
minimal contiguous textual area that includes all the text about the
declaration.

Of course if some library client wish to build a source range between
name and declaration end source location is free to do that.

This solves also easily the problem to know if two declaration are
inside the same group: if two consecutive declaration have same start
location they are inside the same group.

Hello.

In our application we are facing a problem related to the quality of
source location information for Decl nodes. When using method
  Decl::getSourceRange()
we noticed a few cases where we obtain not so accurate location info for
the start of the range.
Here are a few examples.

For DeclaratorDecl nodes:
const int a;
static int b;
inline int foo() { return 5; }
the range starts from "int", rather than "const" or "static" or "inline".

For a NamespaceDecl node such as
namespace N { }
the range starts from the identifier N, rather than from the "namespace"
keyword. Similarly for a linkage specification such as
extern "C" { ... }

All of these minor problems can be easily fixed as soon as there will be
a place where the start for the range can be recorded. For instance,
when creating a new VarDecl node in SemaDecl.cpp:2956

  NewVD = VarDecl::Create(Context, DC, D.getIdentifierLoc(),
                          II, R, TInfo, SC, SCAsWritten);

the Declarator object D could provide accurate source range info, but
the VarDecl constructor has no parameter accepting the start of such a
range: it has a SourceLocation for the identifier loc and a
TypeSourceInfo, which does not take into account syntactic elements such
as storage classes, CVR qualifiers, inline specifiers, etc.
As a result, we obtain strange-looking differences whereby declaration
  int const a;
has an accurate source range, whereas
  const int a;
is skipping the const keyword.

Assuming that the goal of obtaining precise source location info should
be achieved by all means, now the question is: which is best place where
this additional source location (the start of the range) for the Decl
hierarchy should be stored?
Would it be OK to store them at the top of the hierarchy?
Are there better places?

I'd rather that we only add this information to those AST nodes that need it.

I've proposed to put it in Decl as I'm unable to find an AST node that
has no need to store start source location...

I'm missing something?

I don't know without studying every Decl again,. If we're truly missing this information *everywhere*, then it's fine. However, please make sure that you're not introducing redundancy: for example, NamespaceAliasDecl already has its starting location in the member "NamespaceLoc". If we add a StartLoc to Decl, we need to remove NamespaceLoc from NamespaceAliasDecl.

Also, the source range for a DeclaratorDecl (and anything else that stares with declaration specifiers) gets messy when there are multiple declarations, e.g.,

static struct Point { int x, y; } p1, *p2, get_point(int x, int y);

Who claims the "static" keyword? Point? p1? Everyone?

IMHO opinion there is no doubts that the full source range of p2 is from
"static" to "p2" and that the full source range of "get_point" is from
static to ")".

The concept underlying the source range can be easily defined as the
minimal contiguous textual area that includes all the text about the
declaration.

Of course if some library client wish to build a source range between
name and declaration end source location is free to do that.

This solves also easily the problem to know if two declaration are
inside the same group: if two consecutive declaration have same start
location they are inside the same group.

Seems reasonable. So the starting location of Point is still 'static'?

  - Doug

Hello.

In our application we are facing a problem related to the quality of
source location information for Decl nodes. When using method
  Decl::getSourceRange()
we noticed a few cases where we obtain not so accurate location info for
the start of the range.
Here are a few examples.

For DeclaratorDecl nodes:
const int a;
static int b;
inline int foo() { return 5; }
the range starts from "int", rather than "const" or "static" or "inline".

For a NamespaceDecl node such as
namespace N { }
the range starts from the identifier N, rather than from the "namespace"
keyword. Similarly for a linkage specification such as
extern "C" { ... }

All of these minor problems can be easily fixed as soon as there will be
a place where the start for the range can be recorded. For instance,
when creating a new VarDecl node in SemaDecl.cpp:2956

  NewVD = VarDecl::Create(Context, DC, D.getIdentifierLoc(),
                          II, R, TInfo, SC, SCAsWritten);

the Declarator object D could provide accurate source range info, but
the VarDecl constructor has no parameter accepting the start of such a
range: it has a SourceLocation for the identifier loc and a
TypeSourceInfo, which does not take into account syntactic elements such
as storage classes, CVR qualifiers, inline specifiers, etc.
As a result, we obtain strange-looking differences whereby declaration
  int const a;
has an accurate source range, whereas
  const int a;
is skipping the const keyword.

Assuming that the goal of obtaining precise source location info should
be achieved by all means, now the question is: which is best place where
this additional source location (the start of the range) for the Decl
hierarchy should be stored?
Would it be OK to store them at the top of the hierarchy?
Are there better places?

I'd rather that we only add this information to those AST nodes that need it.

I've proposed to put it in Decl as I'm unable to find an AST node that
has no need to store start source location...

I'm missing something?

I don't know without studying every Decl again,. If we're truly missing this information *everywhere*, then it's fine. However, please make sure that you're not introducing redundancy: for example, NamespaceAliasDecl already has its starting location in the member "NamespaceLoc". If we add a StartLoc to Decl, we need to remove NamespaceLoc from NamespaceAliasDecl.

Yes, of course adding StartLocation to Decl we'll remove it from the
derived nodes.

Also, the source range for a DeclaratorDecl (and anything else that stares with declaration specifiers) gets messy when there are multiple declarations, e.g.,

static struct Point { int x, y; } p1, *p2, get_point(int x, int y);

Who claims the "static" keyword? Point? p1? Everyone?

IMHO opinion there is no doubts that the full source range of p2 is from
"static" to "p2" and that the full source range of "get_point" is from
static to ")".

The concept underlying the source range can be easily defined as the
minimal contiguous textual area that includes all the text about the
declaration.

Of course if some library client wish to build a source range between
name and declaration end source location is free to do that.

This solves also easily the problem to know if two declaration are
inside the same group: if two consecutive declaration have same start
location they are inside the same group.

Seems reasonable. So the starting location of Point is still 'static'?

No, the starting location of RecordDecl is on 'struct'.

Of course also the TagDecl with isEmbeddedInDeclarator flag set are in
the same decl group (here we does not need to check the start location).

I had completely forgotten about isEmbeddedInDeclarator. Okay, it sounds like adding start locations will be enough information to make this work. Great!
  - Doug