Reading contents of Comments or any other Token

Hi all,

I love that clang and llvm make it possible for folks like me to create tools that have a deeper understanding of C, C++, and ObjC. For instance, I used clang to write a tool that helped me normalize headers when porting code across platforms. Clang and llvm were easy to get into and well documented. Thanks for the work you all do.

I have a question for you though.

Brief Summary of the rest of the email:
2 Questions

  1. Is there an intended way of getting the contents of a token from just the SourceRange?
  2. Should I be asking the SourceManager for a Buffer for a given FileID and then adjusting pointers into that buffer bast on the Offset of that FileID?

More details
I was considering what it would take to build a doxygen like tool using Clang and found the CommentHandler object and its virtual, HandleComments( Preprocessor, SourceRange). However, I was a bit stuck when I tried to go from the SourceRange to the actual contents of the comment.

I looked at what PPCallback and ASTConsumers offer but didn’t see anything that would lead me to believe I should’ve expected more data from the CommentHandler. I looked at old code I had written that used the Rewriter but that didn’t feel right because I don’t want to Rewrite the comment, I want to parse its contents. So then I started looking around frontend actions that spit out data or modify the guts of a buffer. Eventually I stumbled across the HTMLPrintAction and its corresponding HTMLPrinter.

Inside of HTMLPrinter I noticed the AddLineNumbers method which was performing manipulations based on a raw MemoryBuffer which looked about right. It eventually led me to this prototype code:

/// Write out the entire comment based on the source range.

bool IndentingCommentHandler::HandleComment(Preprocessor &pp, SourceRange rng)
{
FileID FID = pp.getSourceManager().getMainFileID();
const llvm::MemoryBuffer *MB = pp.getSourceManager().getBuffer(FID);

int size = rng.getEnd().getRawEncoding() - rng.getBegin().getRawEncoding();
char *Buff = (char *)calloc(size+1, sizeof(char));

const char *itBeg = MB->getBufferStart();
const char *itEnd = MB->getBufferStart();

unsigned int offset = pp.getSourceManager().getLocForStartOfFile(FID).getRawEncoding();

// Adjust pointers to account for the FileID’s offset in the Source manager.
itBeg -= offset;
itEnd -= offset;

// Adjust pointers relative to where the comment actually begins and ends
for( int i = 0; i < rng.getBegin().getRawEncoding(); i++)
{
++itBeg;
++itEnd;
}

for( int i = 0; i < size; ++i)
{
++itEnd;
}

std::copy(itBeg, itEnd, Buff);
std::cout << “=============================” << std::endl;
std::cout << Buff << std::endl;
free(Buff);
return false;
}

This seems to work for a few test cases I’ve tried it against but also felt a bit verbose. I’m wondering, did I do something stupid? Did I overlook a better or more proper way of this?

Many thanks for any help and guidance,
Larry Olson
(https://github.com/loarabia)

Hi all,

I love that clang and llvm make it possible for folks like me to create tools that have a deeper understanding of C, C++, and ObjC. For instance, I used clang to write a tool that helped me normalize headers when porting code across platforms. Clang and llvm were easy to get into and well documented. Thanks for the work you all do.

I have a question for you though.

Brief Summary of the rest of the email:
2 Questions

  1. Is there an intended way of getting the contents of a token from just the SourceRange?
  2. Should I be asking the SourceManager for a Buffer for a given FileID and then adjusting pointers into that buffer bast on the Offset of that FileID?

More details
I was considering what it would take to build a doxygen like tool using Clang and found the CommentHandler object and its virtual, HandleComments( Preprocessor, SourceRange). However, I was a bit stuck when I tried to go from the SourceRange to the actual contents of the comment.

I looked at what PPCallback and ASTConsumers offer but didn’t see anything that would lead me to believe I should’ve expected more data from the CommentHandler. I looked at old code I had written that used the Rewriter but that didn’t feel right because I don’t want to Rewrite the comment, I want to parse its contents. So then I started looking around frontend actions that spit out data or modify the guts of a buffer. Eventually I stumbled across the HTMLPrintAction and its corresponding HTMLPrinter.

Inside of HTMLPrinter I noticed the AddLineNumbers method which was performing manipulations based on a raw MemoryBuffer which looked about right. It eventually led me to this prototype code:

/// Write out the entire comment based on the source range.

bool IndentingCommentHandler::HandleComment(Preprocessor &pp, SourceRange rng)
{
FileID FID = pp.getSourceManager().getMainFileID();
const llvm::MemoryBuffer *MB = pp.getSourceManager().getBuffer(FID);

int size = rng.getEnd().getRawEncoding() - rng.getBegin().getRawEncoding();
char *Buff = (char *)calloc(size+1, sizeof(char));

const char *itBeg = MB->getBufferStart();
const char *itEnd = MB->getBufferStart();

unsigned int offset = pp.getSourceManager().getLocForStartOfFile(FID).getRawEncoding();

// Adjust pointers to account for the FileID’s offset in the Source manager.
itBeg -= offset;
itEnd -= offset;

// Adjust pointers relative to where the comment actually begins and ends
for( int i = 0; i < rng.getBegin().getRawEncoding(); i++)
{
++itBeg;
++itEnd;
}

for( int i = 0; i < size; ++i)
{
++itEnd;
}

std::copy(itBeg, itEnd, Buff);
std::cout << “=============================” << std::endl;
std::cout << Buff << std::endl;
free(Buff);
return false;
}

This seems to work for a few test cases I’ve tried it against but also felt a bit verbose. I’m wondering, did I do something stupid? Did I overlook a better or more proper way of this?

You should generally avoid using SourceLocation::getRawEncoding(), it is only useful as opaque data, do not use it for offset info.
Check out SourceManager::getDecomposedLoc(SourceLocation); this returns a pair of FileID/offset so you can arrive at a Buffer+offset for a SourceLocation.

Thank you very much.
I looked at the implementation of getDecomposedLoc to make sure I understoodwhat I was getting and then made changes to my sample based on what I saw. Seems to work beautifully and simplifies the code.
(Its funny -- I think I passed through that very code earlier today while tracking getOffset butoverlooked its impact).
Did anything else in my usage of clang seem off?
My Thanks again,Larry Olson