Debugging Information; exploiting comments

Hi Cedric,

This is really a separate thread, so I'm forwarding this to the cfe-dev list.

Everyone: please feel free to chime in.

I am investigating adding debugging information to clang. Do you
think it to
soon? (I would like to add a -g flag to the drivers and add
conditional llvm
debug intrinsic emission in CodeGen).
Do you have already some idea on the question? (I studied llvm-gcc
debug
generation and would add something similar)

I'm not currently involved with the efforts on the CodeGen module. Chris, Devang?

Another idea I had is implementing a documentation tools (like
doxygen)
using clang. The problem is that the existing framework doesn't permit
analyzing the commentary and parsing in the same pass (At least I
don't see
how).

You are indeed correct; currently the parser discard comments when the ASTs are built. We have discussed the technical hurdles of doing "appropriate" handling of comments, as they could be used for doxygen-like tools, as well as for annotations that can be used by other analysis tools.

The main challenge is that comments can appear literally anywhere, and how they conceptually bind to entities in the program (be they declarations or actual statements and expressions) is really specific to the application that uses the comments (e.g. doxygen).

I would need to add some sort of callback in the lexer or
preprocesseur for processing the comment (we can't parse the comment
token,
it would be an impossible task). The callback would store the
comment and
when the next declaration would be parsed, the stored comment is
used for
decoration the declaration. Do you think this is a good way?

I'm not entirely certain how comments are processed by the lexer and parser, and how easy it would be to add a callback. I believe that it is doable, but I haven't really looked at that code. Steve, Chris?

Conceptually, if a callback mechanism for parsing comments is in place you could then do whatever you wanted with the comments, although it wouldn't necessarily be easy (it would depend on your application). The ideal solution would be to separate the policy of how comments are used (e.g. how you bind them to expressions, statements, declarations, and so forth) with how they are parsed (or rather, how the ASTs are built in Sema). That way a bunch of tools that process comments could be built instead of a single ad hoc solution. We also don't want to get into the business of people unnecessarily hacking on the Sema module where the ASTs are built and semantically analyzed. Such hacks would inevitably cause tools built on such hacks to diverge from the functionality available in "mainline" clang.

Another
way
would be to add some sort of filter between the lexer and the parser
which
would process and delete the comment token as they come, but it would
probably be slower and on the critical path (not sure the lexing/
parsing
part is time critical since the semantical analysis will eventually
probably
be a lot slower).

I'm not certain if I completely understand this solution. At the end of the day you still need to bind comments (or whatever data you extract from them) to ASTs (decls, etc.). Since the parser/lexer has no notion of ASTs, you almost necessarily have to put some of the key logic at a higher level (e.g., the Sema module). IIRC, essentially the parser and lexer just build tokens and process the C grammar; Sema actually builds the ASTs based on an interface between it an the parser.

>> I am investigating adding debugging information to clang. Do you
>> think it to
>> soon? (I would like to add a -g flag to the drivers and add
>> conditional llvm
>> debug intrinsic emission in CodeGen).
>> Do you have already some idea on the question? (I studied llvm-gcc
>> debug
>> generation and would add something similar)

I'm not currently involved with the efforts on the CodeGen module.
Chris, Devang?

I started coding and have a few functionalities working but I hesitate
between two possible implementations.

What I started doing is:

- Add members and accessor functions to CodeGenModule and CodeGenFuntion.
- Insert code directly in the code generating function of these class,
conditionnaly activated on a flag:

I am investigating adding debugging information to clang. Do you
think it to
soon? (I would like to add a -g flag to the drivers and add
conditional llvm
debug intrinsic emission in CodeGen).
Do you have already some idea on the question? (I studied llvm-gcc
debug
generation and would add something similar)

I'm not currently involved with the efforts on the CodeGen module.
Chris, Devang?

Noone is working on extending clang to produce debug info. Note that -emit-llvm in general is still pretty early on. It is sufficient to run a bunch of small programs, but still has a number of limitations. We'll probably focus on getting -emit-llvm to 100% completeness and correctness before tackling debug info.

If you wanted to tackle debug info, I think it would be a great project! It is also nicely parallelizable. I'd get a copy of llvm-gcc and see the debug info it emits at -O0 -g for reference.

I started coding and have a few functionalities working but I hesitate
between two possible implementations.

What I started doing is:

- Add members and accessor functions to CodeGenModule and CodeGenFuntion.
- Insert code directly in the code generating function of these class,
conditionnaly activated on a flag:

I'd suggest adding a new class, e.g. CodeGenDebugInfo that holds the debug related information. This is the model that llvm-gcc uses it it's llvm-convert.cpp and llvm-debug.cpp files. The nice thing about this is that you can then do stuff like:

void CodeGenFunction::GenerateCode(const FunctionDecl *FD) {
[...]

  // Create subprogram descriptor.
  if (DebugInfo)
    DebugInfo->CreateSubProgramDesc(...);

The nice thing about this model is that it keeps the debug info emission code localized in one file, while making the hooks into it obvious.

Sorry for the long long mail...

No problem! I'm glad you're interested in this area,

Ps2: Should the first letter of method be a upper or a lower case:
setA(); or SetA();
getA();
isA();

There isn't a standard naming scheme :-/ , I'd just follow the example of the code that it interacts with.

ps3: If my English is too bad, says it and I will try to do better. The
thing is I lake practice and I have never be good in grammar (even In my
natural language)

Your English is excellent!

-Chris

I think that Ted covered all the important points with this in his previous email. The meta issue is that it is tricky. For example, it is reasonably easy to buffer up comments and have the next decl that occurs apply the buffer to the decl. This covers stuff like this:

/// foo
int X;

However, sometimes you want to apply it to the previous decl as well:

int X; /// fooo!!

I think it would be great to have a nice (but optional) way to handle this. I think that buffering up comments and having the AST builder code suck them out when needed is the right way to go...

-Chris