Preprocessed loc/token retrieval dream (almost) come true

Clang has always missed the possibility to reconstruct the preprocessed
token stream from a given location (without redoing the full preprocessing).

Thanks to recent changes from Chandler and Argyrios I'm now able to get
the next parsed token location in a reliable way.

I attach the code I use currently for review and to check if there is
interest to have these helpers in clang library (IMHO this service is
*very* useful and currently badly approximated in HTMLRewrite.cpp).

The code use show also some likely bugs in clang location storing, namely:

- the SLocEntry for macro arg expansion has an extra token at end and
this is not taken in consideration when computing isInFileID (a
workaround for that is in the attached code)

- immediate expansion range of stringified tokens enclose only '#' and
not '# arg' (this implies that the helper get confused there)

- immediate expansion range of concatenated tokens enclose only '##' and
not 'x ## y' (this implies that the helper get confused there)

The code currently still does not take in account file changes due to
#include, but I think this is a minor point and perhaps fixable.

To do its work parser_loc_get_pp_next needs that a reverse map is loaded
so to know which tokens are expansion point (i.e. a SourceLocation for
each macro SLocEntry).

typedef llvm::DenseMap<unsigned, clang::SrcMgr::SLocEntry> Exp_Map;

Exp_Map exp_map;

void load_exp_map() {
  using namespace clang;
  SourceManager& sm = get_source_manager();
  int i, last = sm.local_sloc_entry_size();
  for (i = 0; i < last; ++i) {
    FileID fid;
    // This method is private.
    // fid = FileID::get(i);
    // Ugly dirty trick is needed
    *reinterpret_cast<int*>(&fid) = i;
    SrcMgr::SLocEntry entry = sm.getSLocEntry(fid);
    if (!entry.isExpansion())
      continue;
    SourceLocation from = entry.getExpansion().getExpansionLocStart();
    exp_map[from.getRawEncoding()] = entry;
  }
}

clang::SourceLocation parser_loc_get_pp_next(clang::SourceLocation cur) {
  using namespace clang;
  SourceManager& sm = get_source_manager();
  const clang::LangOptions& lo = get_lang_options();
  assert(exp_map.find(cur.getRawEncoding()) == exp_map.end());
  SourceLocation next;
  while (1) {
    std::pair<FileID, unsigned> cur_info = sm.getDecomposedLoc(cur);
    SourceLocation scur = sm.getSpellingLoc(cur);
    std::pair<FileID, unsigned> scur_info = sm.getDecomposedLoc(scur);
    bool invalid = false;
    StringRef buf = sm.getBufferData(scur_info.first, &invalid);
    if (invalid)
      return SourceLocation();
    const char* point = buf.data() + scur_info.second;
    Lexer lexer(sm.getLocForStartOfFile(scur_info.first), lo,
                buf.begin(), point, buf.end());
    Token tok;
    lexer.LexFromRawLexer(tok);
    lexer.LexFromRawLexer(tok);
    if (tok.is(tok::eof)) {
      if (!cur.isMacroID())
        return SourceLocation();
    }
    else {
      SourceLocation snext = tok.getLocation();
      unsigned dist = sm.getFileOffset(snext) - scur_info.second;
      // Dirty trick to apply offset to macro loc
      next = SourceLocation::getFromRawEncoding(cur.getRawEncoding() +
dist);
      // The following conditional is needed only to workaround a
      // likely bug in SourceManager::isInFileID when called with macro arg
      // expansions.
      if (sm.isMacroArgExpansion(cur)) {
        // Dirty trick to apply offset to macro loc
        if
(sm.isInFileID(SourceLocation::getFromRawEncoding(cur.getRawEncoding() +
dist + 1), cur_info.first))
          break;
      }
      else {
        if (sm.isInFileID(next, cur_info.first))
          break;
      }
    }
    cur = sm.getImmediateExpansionRange(cur).second;
  }
  while (1) {
    Exp_Map::iterator i = exp_map.find(next.getRawEncoding());
    if (i == exp_map.end())
      break;
    SrcMgr::SLocEntry entry = i->second;
    // This method is private.
    // next = SourceLocation::getMacroLoc(entry.getOffset());
    // Ugly dirty trick is needed
    next = SourceLocation::getFromRawEncoding(entry.getOffset() | (1 <<
31));
    assert(next.isMacroID());
  }
  return next;
}

Ping and direct questions below.

Clang has always missed the possibility to reconstruct the preprocessed
token stream from a given location (without redoing the full preprocessing).

Thanks to recent changes from Chandler and Argyrios I'm now able to get
the next parsed token location in a reliable way.

I attach the code I use currently for review and to check if there is
interest to have these helpers in clang library (IMHO this service is
*very* useful and currently badly approximated in HTMLRewrite.cpp).

There is interest on having in clang library the methods to get from a
starting location all the locations for following tokens in
preprocessing order? This would permit to know if *all* the locations in
a specific range satisfies a given property, to get the missing
locations, to scan the exact preprocessed sequence of type/storage
specifiers, etc.

The code use show also some likely bugs in clang location storing, namely:

- the SLocEntry for macro arg expansion has an extra token at end and
this is not taken in consideration when computing isInFileID (a
workaround for that is in the attached code)

Is this intended or it should be considered a bug?

- immediate expansion range of stringified tokens enclose only '#' and
not '# arg' (this implies that the helper get confused there)

Is this intended or it should be considered a bug?

- immediate expansion range of concatenated tokens enclose only '##' and
not 'x ## y' (this implies that the helper get confused there)

Is this inteded or it should be considered a bug?

Hi Abramo,

Sorry to disappoint you but I think the dream remains unfulfilled :wink:

Ping and direct questions below.

Clang has always missed the possibility to reconstruct the preprocessed
token stream from a given location (without redoing the full preprocessing).

Thanks to recent changes from Chandler and Argyrios I'm now able to get
the next parsed token location in a reliable way.

I attach the code I use currently for review and to check if there is
interest to have these helpers in clang library (IMHO this service is
*very* useful and currently badly approximated in HTMLRewrite.cpp).

There is interest on having in clang library the methods to get from a
starting location all the locations for following tokens in
preprocessing order? This would permit to know if *all* the locations in
a specific range satisfies a given property, to get the missing
locations, to scan the exact preprocessed sequence of type/storage
specifiers, etc.

The code that you posted was a bit hard to follow but correct me if I'm wrong;
You are recording all macro expansion points and once you hit one, you enter the SLocEntry for the macro expansion and start lexing it, is this correct ?

This may seem to work but it is not reliable. The main issue is that for macro arguments expansion we do *not* guarantee that the range of the SLocEntry contains only the tokens that were actually lexed.
This is because we aggressively "merge" them to reduce the number of needed SLocEntries.

Here's an example:

#define M1 1
#define M2 2
#define M3 3

#define MA1(a,b,c) a c
#define MA2(x) x

MA2( MA1(M1, M2, M3) )

The tokens that MA2 ultimately receives are '1' and '3' but if you follow through and lex the SLocEntry that gets created for the macro arg expansion for MA2, you will notice that the length is 5 and it is actually a chunk encompassing "1 2 3".

So, from this chunk, only '1' and '3' and their respective locations were actually passed to the parser but you don't know that just by looking at the SLocEntry.

Apart from that, this is trying to deal with macro expansions; how are you handling preprocessor directives ? e..g:

X
#if ...
Y
#else
X
#endif

How do you find out what comes after 'X' if you don't preprocess ?

The code use show also some likely bugs in clang location storing, namely:

- the SLocEntry for macro arg expansion has an extra token at end and
this is not taken in consideration when computing isInFileID (a
workaround for that is in the attached code)

Is this intended or it should be considered a bug?

No, this is the nature of SLocEntry, it is not reliable for trying to find out the preprocessed tokens.

- immediate expansion range of stringified tokens enclose only '#' and
not '# arg' (this implies that the helper get confused there)

Is this intended or it should be considered a bug?

This is reasonable and good idea.

- immediate expansion range of concatenated tokens enclose only '##' and
not 'x ## y' (this implies that the helper get confused there)

Is this inteded or it should be considered a bug?

As is this.

-Argyrios

Hi Abramo,

Sorry to disappoint you but I think the dream remains unfulfilled :wink:

You make me sad for a few minutes... but let try to find a solution: I
think that to get preprocessed tokens has too many benefits to stop only
a few steps before to accomplish that.

Let me know if you don't see strong benefits in the possibility to get
the preprocessed tokens in a range.

First the easy part:

Apart from that, this is trying to deal with macro expansions; how are

you handling preprocessor directives ? e..g:

X
#if ...
Y
#else
X
#endif

How do you find out what comes after 'X' if you don't preprocess ?

Preprocessor callbacks give us complete info about skipped area so the
helper just have to take in account that.

The same is true for file changes:

X
#include "..."
Y

The code that you posted was a bit hard to follow but correct me if I'm wrong;
You are recording all macro expansion points and once you hit one, you enter the SLocEntry for the macro expansion and start lexing it, is this correct ?

Yes, and my tests show that it works very well in most cases.

This may seem to work but it is not reliable. The main issue is that for macro arguments expansion we do *not* guarantee that the range of the SLocEntry contains only the tokens that were actually lexed.
This is because we aggressively "merge" them to reduce the number of needed SLocEntries.

Here's an example:

#define M1 1
#define M2 2
#define M3 3

#define MA1(a,b,c) a c
#define MA2(x) x

MA2( MA1(M1, M2, M3) )

The tokens that MA2 ultimately receives are '1' and '3' but if you follow through and lex the SLocEntry that gets created for the macro arg expansion for MA2, you will notice that the length is 5 and it is actually a chunk encompassing "1 2 3".

So, from this chunk, only '1' and '3' and their respective locations were actually passed to the parser but you don't know that just by looking at the SLocEntry.

How can I avoid that "optimization" and thus verify the real memory
impact with some huge and relevant testcases?

Many thanks for your help and your review.

Abramo

Hi Abramo,

Sorry to disappoint you but I think the dream remains unfulfilled :wink:

You make me sad for a few minutes... but let try to find a solution: I
think that to get preprocessed tokens has too many benefits to stop only
a few steps before to accomplish that.

Let me know if you don't see strong benefits in the possibility to get
the preprocessed tokens in a range.

First the easy part:

Apart from that, this is trying to deal with macro expansions; how are

you handling preprocessor directives ? e..g:

X
#if ...
Y
#else
X
#endif

How do you find out what comes after 'X' if you don't preprocess ?

Preprocessor callbacks give us complete info about skipped area so the
helper just have to take in account that.

The same is true for file changes:

X
#include "..."
Y

How about investigating whether it is possible/viable to extend the preprocessor callbacks in a way that you can get at the macro expanded locations/tokens in a reliable way ?

The code that you posted was a bit hard to follow but correct me if I'm wrong;
You are recording all macro expansion points and once you hit one, you enter the SLocEntry for the macro expansion and start lexing it, is this correct ?

Yes, and my tests show that it works very well in most cases.

This may seem to work but it is not reliable. The main issue is that for macro arguments expansion we do *not* guarantee that the range of the SLocEntry contains only the tokens that were actually lexed.
This is because we aggressively "merge" them to reduce the number of needed SLocEntries.

Here's an example:

#define M1 1
#define M2 2
#define M3 3

#define MA1(a,b,c) a c
#define MA2(x) x

MA2( MA1(M1, M2, M3) )

The tokens that MA2 ultimately receives are '1' and '3' but if you follow through and lex the SLocEntry that gets created for the macro arg expansion for MA2, you will notice that the length is 5 and it is actually a chunk encompassing "1 2 3".

So, from this chunk, only '1' and '3' and their respective locations were actually passed to the parser but you don't know that just by looking at the SLocEntry.

How can I avoid that "optimization" and thus verify the real memory
impact with some huge and relevant testcases?

See TokenLexer::updateLocForMacroArgTokens.

FYI, dropping the optimization and trying to guarantee that SLocEntries contain only lexed tokens is going to have very little chance of getting in trunk, see the wins of the optimization here: http://lists.cs.uiuc.edu/pipermail/cfe-commits/Week-of-Mon-20110822/045495.html

One way might be to call a post expansion callback passing the expanded
tokens, but I'm not sure this might be accepted in trunk...

Do you have better ideas?

Hi Abramo,

Sorry to disappoint you but I think the dream remains unfulfilled :wink:

You make me sad for a few minutes... but let try to find a solution: I
think that to get preprocessed tokens has too many benefits to stop only
a few steps before to accomplish that.

Let me know if you don't see strong benefits in the possibility to get
the preprocessed tokens in a range.

First the easy part:

Apart from that, this is trying to deal with macro expansions; how are

you handling preprocessor directives ? e..g:

X
#if ...
Y
#else
X
#endif

How do you find out what comes after 'X' if you don't preprocess ?

Preprocessor callbacks give us complete info about skipped area so the
helper just have to take in account that.

The same is true for file changes:

X
#include "..."
Y

How about investigating whether it is possible/viable to extend the preprocessor callbacks in a way that you can get at the macro expanded locations/tokens in a reliable way ?

One way might be to call a post expansion callback passing the expanded
tokens, but I'm not sure this might be accepted in trunk...

I think it is reasonable if it is opt-in, that is, the preprocessor callback implementer indicates that it wants or not the preprocessor to provide such info.