Patch to allow comment translators implementation

This small patch change comments handler in a simple way to permit to
implement quite easily comment translators.

Once applied this patch, a CommentHandler is allowed to build a first
token to be returned to Lexer and to push a TokenStream for the others,
then allowing generic comment -> tokens transformer.

This can be useful to transform comment shaped program annotation that
should be translated to source code and also other interesting applications.

Please let me know if it can be applied as it is or if I can improve it
somehow.

HandleComment.patch (4.71 KB)

This is an interesting approach. The only major concern I have is that this only allows you to translate comments into exactly one token. In the case of openmp pragmas (for example) this doesn't seem rich enough. A different approach would be to allow the handler to push an arbitrary number of tokens into the parser's lookahead buffer. Would this work for what you're trying to do?

-Chris

Yes, but perhaps this is not needed: as I wrote the CommentHandler could
return a first token *and* produce the other tokens to be read and push
them to lexer stack using EnterTokenStream.

I've already tried this with success in a sample implementation that
simply lex the comment content without modify it:

bool Comment_Converter::HandleComment(clang::Preprocessor &PP,
                                      clang::Token& token,
                                      clang::SourceRange Comment) {
  const clang::SourceManager &sm = PP.getSourceManager();
  const clang::LangOptions &lo = PP.getLangOptions();
  clang::SourceLocation begin = Comment.getBegin();
  clang::FileID fid = sm.getFileID(begin);
  const char* start = sm.getCharacterData(begin);
  const char* end = sm.getCharacterData(Comment.getEnd());
  if (start[1] == '*')
    end -= 2;
  start += 2;
  char saved = *end;
  *const_cast<char*>(end) = 0;
  clang::Lexer lexer(sm.getLocForStartOfFile(fid), lo,
                     sm.getBufferData(fid).first,
                     start, end);
  clang::Token tok;
  lexer.LexFromRawLexer(tok);
  if (tok.is(clang::tok::eof)) {
    *const_cast<char*>(end) = 0;
    return false;
  }
  token = tok;
  static std::vector<clang::Token> tokens;
  tokens.clear();
  while (1) {
    lexer.LexFromRawLexer(tok);
    if (tok.is(clang::tok::eof))
      break;
    if (tok.is(clang::tok::identifier))
      tok.setKind(PP.LookUpIdentifierInfo(tok)->getTokenID());
    tokens.push_back(tok);
  }
  *const_cast<char*>(end) = saved;
  if (tokens.size() > 0)
    PP.EnterTokenStream(tokens.data(), tokens.size(), false, false);
  return true;
}

This small patch change comments handler in a simple way to permit to
implement quite easily comment translators.

Once applied this patch, a CommentHandler is allowed to build a first
token to be returned to Lexer and to push a TokenStream for the others,
then allowing generic comment -> tokens transformer.

This can be useful to transform comment shaped program annotation that
should be translated to source code and also other interesting applications.

This is an interesting approach. The only major concern I have is
that this only allows you to translate comments into exactly one
token. In the case of openmp pragmas (for example) this doesn't seem

Do I've been sufficient clear explaining that the comments may be
translated to an arbitrary number of tokens calling EnterTokenStream
inside the CommentHandler?

rich enough. A different approach would be to allow the handler to
push an arbitrary number of tokens into the parser's lookahead
buffer. Would this work for what you're trying to do?

Yes, but perhaps this is not needed: as I wrote the CommentHandler could
return a first token *and* produce the other tokens to be read and push
them to lexer stack using EnterTokenStream.

I've already tried this with success in a sample implementation that
simply lex the comment content without modify it:

Still I've not got any feedback: do you think that the patch in original
mail will be applied as is? Should I improve it in some way?

This small patch change comments handler in a simple way to permit to
implement quite easily comment translators.

Once applied this patch, a CommentHandler is allowed to build a first
token to be returned to Lexer and to push a TokenStream for the others,
then allowing generic comment -> tokens transformer.

This can be useful to transform comment shaped program annotation that
should be translated to source code and also other interesting applications.

This is an interesting approach. The only major concern I have is
that this only allows you to translate comments into exactly one
token. In the case of openmp pragmas (for example) this doesn't seem

Do I've been sufficient clear explaining that the comments may be
translated to an arbitrary number of tokens calling EnterTokenStream
inside the CommentHandler?

Yes, but I find the protocol for introducing tokens via a comment handler to be very confusing. Could we instead eliminate the Token &token argument, and just make the protocol: to "parse" the contents of the comment, use EnterTokenStream and then return true?

Or, at the very least, the "token" argument should be named "firstToken", to indicate that it is possible to inject other tokens. Of course, HandleComment also needs documentation to describe what the parameters and return value actually mean, and how comment handlers can introduce tokens into the stream.

rich enough. A different approach would be to allow the handler to
push an arbitrary number of tokens into the parser's lookahead
buffer. Would this work for what you're trying to do?

Yes, but perhaps this is not needed: as I wrote the CommentHandler could
return a first token *and* produce the other tokens to be read and push
them to lexer stack using EnterTokenStream.

I've already tried this with success in a sample implementation that
simply lex the comment content without modify it:

Still I've not got any feedback: do you think that the patch in original
mail will be applied as is? Should I improve it in some way?

I think it's okay if the HandleComment protocol can be simplified a bit and if it is documented, although I'd like to hear from Chris. I'd feel much better if we actually had some kind of use of this code path within Clang itself. For example, would it be possible for the keep-comments mode to be implemented outside of the lexer using your changes to HandleComment? That might actually simplify the lexer while making it more general.

  - Doug

This small patch change comments handler in a simple way to permit to
implement quite easily comment translators.

Once applied this patch, a CommentHandler is allowed to build a first
token to be returned to Lexer and to push a TokenStream for the
others,
then allowing generic comment -> tokens transformer.

This can be useful to transform comment shaped program annotation that
should be translated to source code and also other interesting
applications.

This is an interesting approach. The only major concern I have is
that this only allows you to translate comments into exactly one
token. In the case of openmp pragmas (for example) this doesn't seem

Do I've been sufficient clear explaining that the comments may be
translated to an arbitrary number of tokens calling EnterTokenStream
inside the CommentHandler?

Yes, but I find the protocol for introducing tokens via a comment
handler to be very confusing. Could we instead eliminate the Token
&token argument, and just make the protocol: to "parse" the contents of
the comment, use EnterTokenStream and then return true?

The only proposal I see that would not change a lot of things is to
remove Token argument from CommentHandler and to specify that they
always shall use EnterTokenStream also when they need to insert only one
token.
Then at end of Preprocessor::HandleComment if any of the comment
handlers has informed the caller to have inserted some tokens the first
one available is fetched and returned by Preprocessor::HandleComment.

However the presence of argument for returned Token is perfectly
congruent with each other Handler.

Or, at the very least, the "token" argument should be named
"firstToken", to indicate that it is possible to inject other tokens. Of
course, HandleComment also needs documentation to describe what the
parameters and return value actually mean, and how comment handlers can
introduce tokens into the stream.

Yes, I will do all that before to submit final version of the patch and
once we agreed on the choosen approach.

rich enough. A different approach would be to allow the handler to
push an arbitrary number of tokens into the parser's lookahead
buffer. Would this work for what you're trying to do?

Yes, but perhaps this is not needed: as I wrote the CommentHandler could
return a first token *and* produce the other tokens to be read and push
them to lexer stack using EnterTokenStream.

I've already tried this with success in a sample implementation that
simply lex the comment content without modify it:

Still I've not got any feedback: do you think that the patch in original
mail will be applied as is? Should I improve it in some way?

I think it's okay if the HandleComment protocol can be simplified a bit
and if it is documented, although I'd like to hear from Chris. I'd feel
much better if we actually had some kind of use of this code path within
Clang itself. For example, would it be possible for the keep-comments
mode to be implemented outside of the lexer using your changes to
HandleComment? That might actually simplify the lexer while making it
more general.

... hmmm, probably it's feasible, but I'm not so sure it's a good idea
because this means we should always have a CommentHandler loaded and we
have the problem of CommentHandler execution order, as only the first
CommentHandler that want to return a first Token is allowed to do that
(unless we implement the variant above).

With proposed patch, instead, the CommentHandler that want to transform
comments has always the priority over keep-comments mode.

That apart I'd like to proceed by steps, what do you think about a
resubmittal of proposed patch with Token argument renamed in FirstToken
and proper documentation of HandleComment protocol?

Or you prefer I remove Token argument as described in the variant above
although it's a bit less efficient for returning a single token?

Yes, but I find the protocol for introducing tokens via a comment
handler to be very confusing. Could we instead eliminate the Token
&token argument, and just make the protocol: to "parse" the contents of
the comment, use EnterTokenStream and then return true?

The only proposal I see that would not change a lot of things is to
remove Token argument from CommentHandler and to specify that they
always shall use EnterTokenStream also when they need to insert only one
token.
Then at end of Preprocessor::HandleComment if any of the comment
handlers has informed the caller to have inserted some tokens the first
one available is fetched and returned by Preprocessor::HandleComment.

Yes, I think this is a good plan.

I think it's okay if the HandleComment protocol can be simplified a bit
and if it is documented, although I'd like to hear from Chris. I'd feel
much better if we actually had some kind of use of this code path within
Clang itself. For example, would it be possible for the keep-comments
mode to be implemented outside of the lexer using your changes to
HandleComment? That might actually simplify the lexer while making it
more general.

... hmmm, probably it's feasible, but I'm not so sure it's a good idea
because this means we should always have a CommentHandler loaded and we
have the problem of CommentHandler execution order, as only the first
CommentHandler that want to return a first Token is allowed to do that
(unless we implement the variant above).

I don't think it's necessary to reimplement -C. This functionality would be useful to implement the OpenMP pragma language, for example. What do you plan to use this for? Would any of it be coming back to mainline at some point?

-Chris

The only proposal I see that would not change a lot of things is to
remove Token argument from CommentHandler and to specify that they
always shall use EnterTokenStream also when they need to insert only one
token.
Then at end of Preprocessor::HandleComment if any of the comment
handlers has informed the caller to have inserted some tokens the first
one available is fetched and returned by Preprocessor::HandleComment.

Yes, I think this is a good plan.

You find attached a revised patch (based on r93512) that work as designed.

The patch passes all clang tests.

I think it's okay if the HandleComment protocol can be simplified a bit
and if it is documented, although I'd like to hear from Chris. I'd feel
much better if we actually had some kind of use of this code path within
Clang itself. For example, would it be possible for the keep-comments
mode to be implemented outside of the lexer using your changes to
HandleComment? That might actually simplify the lexer while making it
more general.

... hmmm, probably it's feasible, but I'm not so sure it's a good idea
because this means we should always have a CommentHandler loaded and we
have the problem of CommentHandler execution order, as only the first
CommentHandler that want to return a first Token is allowed to do that
(unless we implement the variant above).

I don't think it's necessary to reimplement -C. This functionality
would be useful to implement the OpenMP pragma language, for example.
What do you plan to use this for? Would any of it be coming back to
mainline at some point?

We intend to have some specially formed comments that could be
translated on demand into tokens to be parsed together with original
source while being transparent to each other compiler.

This could be used for program annotations, constraints for programming
by contract, instrumentation directives, special purpose assertions or
other things.

Although I think that the availability of this service in the clang
library is indeed valuable, I currently don't see a specific application
of this service suitable for current clang mainline.

HandleComment.patch (6.71 KB)

The only proposal I see that would not change a lot of things is to
remove Token argument from CommentHandler and to specify that they
always shall use EnterTokenStream also when they need to insert only one
token.
Then at end of Preprocessor::HandleComment if any of the comment
handlers has informed the caller to have inserted some tokens the first
one available is fetched and returned by Preprocessor::HandleComment.

Yes, I think this is a good plan.

You find attached a revised patch (based on r93512) that work as designed.

The patch passes all clang tests.

Looks fine to me. Chris?

   - Doug

Looks good to me, committed in r93795. I didn't commit the lib/Lex/PPDirectives.cpp hunk, because it didn't look necessary. Please correct me if I'm wrong.

-Chris

This part of the patch is strictly related to the others: the way for
CommentHandler to push tokens is to call EnterTokenStream and this
method add a TokenLexer (that was ignored by SkipExcludedConditionalBlock)

Without this part of the patch the pushed tokens related to comments
found inside skipped conditional block are not read until return from
Preprocessor::SkipExcludedConditionalBlock and this is clearly unwanted.

I don't see that. According to C, the contents of #if 0 blocks are *completely* skipped and may not even be syntactically valid. Using openmp as an example, it would be perfectly find to have completely malformed openmp comments in #if 0 blocks.

#if 0 skipping is also performance sensitive.

-Chris

Looks fine to me. Chris?

Looks good to me, committed in r93795. I didn't commit the
lib/Lex/PPDirectives.cpp hunk, because it didn't look necessary. Please
correct me if I'm wrong.

This part of the patch is strictly related to the others: the way for
CommentHandler to push tokens is to call EnterTokenStream and this
method add a TokenLexer (that was ignored by SkipExcludedConditionalBlock)

Without this part of the patch the pushed tokens related to comments
found inside skipped conditional block are not read until return from
Preprocessor::SkipExcludedConditionalBlock and this is clearly unwanted.

I don't see that. According to C, the contents of #if 0 blocks are
*completely* skipped and may not even be syntactically valid. Using
openmp as an example, it would be perfectly find to have completely
malformed openmp comments in #if 0 blocks.

Perhaps I've explained badly, please follow this example:

#if 0
/* openmp comments */
int a;
#endif

The preprocessor see the #if 0 directive and then enter
SkipExcludedConditionalBlock where the lexing proceed.

When we lex the comment the CommentHandler is invoked and it push some
tokens using EnterTokenStream (i.e. setting CurTokenLexer), but none of
this tokens is read by SkipExcludedConditionalBlock (without the patch
you've discarded).

This means that when SkipExcludedConditionalBlock returns and normal
lexing proceed the Preprocessor will read the tokens previously inserted
in the bad context i.e. after the #endif.

#if 0 skipping is also performance sensitive.

I hardly believe that between:

    if (CurLexer)
      CurLexer->Lex(Tok);
    else
      CurPTHLexer->Lex(Tok);

and

  void Lex(Token &Result) {
    if (CurLexer)
      CurLexer->Lex(Result);
    else if (CurPTHLexer)
      CurPTHLexer->Lex(Result);
    else if (CurTokenLexer)
      CurTokenLexer->Lex(Result);
    else
      CachingLex(Result);
  }

there is a performance difference... I'm missing something?

There are a few alternatives wrt to apply the missing part of the
original patch:

1) do not call comment handlers when skipping excluded conditional block

2) have a way inside comment handlers to know that we are inside an
excluded conditional block (so to avoid to call EnterTokenStream)

I think that 2) is better as it is more flexyble.

There are a few alternatives wrt to apply the missing part of the
original patch:

1) do not call comment handlers when skipping excluded conditional block

2) have a way inside comment handlers to know that we are inside an
excluded conditional block (so to avoid to call EnterTokenStream)

I think that 2) is better as it is more flexyble.

Perhaps I've explained badly, please follow this example:

#if 0
/* openmp comments */
int a;
#endif

The preprocessor see the #if 0 directive and then enter
SkipExcludedConditionalBlock where the lexing proceed.

When we lex the comment the CommentHandler is invoked and it push some
tokens using EnterTokenStream (i.e. setting CurTokenLexer), but none of
this tokens is read by SkipExcludedConditionalBlock (without the patch
you've discarded).

This means that when SkipExcludedConditionalBlock returns and normal
lexing proceed the Preprocessor will read the tokens previously inserted
in the bad context i.e. after the #endif.

#if 0 skipping is also performance sensitive.

There are a few alternatives wrt to apply the missing part of the
original patch:

1) do not call comment handlers when skipping excluded conditional block

2) have a way inside comment handlers to know that we are inside an
excluded conditional block (so to avoid to call EnterTokenStream)

I think that 2) is better as it is more flexyble.

I've not yet heard anything from you, what do you think is the proper
way to fix the problem above?

Hi Abramo,

I'm sorry for the delay. I think that #1 is the best approach. Comment handles should not be invoked on things in #if 0 blocks, because #if 0 blocks should be completely skipped according to the rules of the preprocessor: not doing this would invalidate the "include once" optimization that is applied to files wrapped in the "#ifndef FOO_H / #define FOO_H ... #endif" idiom.

In short, I think that #1 is the right fix.

-Chris

1) do not call comment handlers when skipping excluded conditional block

2) have a way inside comment handlers to know that we are inside an
excluded conditional block (so to avoid to call EnterTokenStream)

I think that 2) is better as it is more flexyble.

I've not yet heard anything from you, what do you think is the proper
way to fix the problem above?

Hi Abramo,

I'm sorry for the delay. I think that #1 is the best approach. Comment handles should not be invoked on things in #if 0 blocks, because #if 0 blocks should be completely skipped according to the rules of the preprocessor: not doing this would invalidate the "include once" optimization that is applied to files wrapped in the "#ifndef FOO_H / #define FOO_H ... #endif" idiom.

In short, I think that #1 is the right fix.

I've attached a (trivial) patch to do that.

However in his message, some minutes ago Douglas wrote:

We have some alternative way to cope with this problem:
>
> 1) make SkipExcludedConditionalBlock to consume also the tokens from
> CurTokenLexer

This seems like the right approach to me.

This was my original choice that Chris thought was inappropriate and he
removed that part of my patch in his commit.

Personally I've no strong preference between a choice or the other,
however I resume here the benefits I see for the two approaches:

Chris method (avoid to call comment handler when skipping):

- performance (we don't have to check from which lexer we are currently
reading the tokens in SkipExcludedConditionalBlock)

Douglas method (make SkipExcludedConditionalBlock to consume also the
tokens from CurTokenLexer):

- possibility to exit from excluded conditional block with tokens
inserted by comment handlers (with Chris method it's impossible)

HandleComment.patch (3.44 KB)

1) do not call comment handlers when skipping excluded conditional block

2) have a way inside comment handlers to know that we are inside an
excluded conditional block (so to avoid to call EnterTokenStream)

I think that 2) is better as it is more flexyble.

I've not yet heard anything from you, what do you think is the proper
way to fix the problem above?

Hi Abramo,

I'm sorry for the delay. I think that #1 is the best approach. Comment handles should not be invoked on things in #if 0 blocks, because #if 0 blocks should be completely skipped according to the rules of the preprocessor: not doing this would invalidate the "include once" optimization that is applied to files wrapped in the "#ifndef FOO_H / #define FOO_H ... #endif" idiom.

In short, I think that #1 is the right fix.

I've attached a (trivial) patch to do that.

However in his message, some minutes ago Douglas wrote:

We have some alternative way to cope with this problem:
>
> 1) make SkipExcludedConditionalBlock to consume also the tokens from
> CurTokenLexer

This seems like the right approach to me.

This was my original choice that Chris thought was inappropriate and he
removed that part of my patch in his commit.

Personally I've no strong preference between a choice or the other,
however I resume here the benefits I see for the two approaches:

Chris method (avoid to call comment handler when skipping):

- performance (we don't have to check from which lexer we are currently
reading the tokens in SkipExcludedConditionalBlock)

Douglas method (make SkipExcludedConditionalBlock to consume also the
tokens from CurTokenLexer):

- possibility to exit from excluded conditional block with tokens
inserted by comment handlers (with Chris method it's impossible)

HandleComment.patch (3.44 KB)

Chris has far more expertise in the preprocessor than I; go with the Chris method. (I missed his e-mail).

  - Doug

Thanks Abramo,

Does it work to use a simpler patch that does something like this in both places that currently call HandleComment:

    // Found but did not consume the newline.
- if (PP && PP->HandleComment(Result,
- SourceRange(getSourceLocation(BufferPtr),