Auto-generation of ASTMatchers predicates from source code, proof-of-concept

Hello,

I have played today with ASTMatchers/Refactoring - it is great tool, thank you guys!

I am thinking about tool which would automatically generate strictest possible (or at least very strict) matcher predicate for given piece of code. That would help to "dive into" predicate syntax for new users of ASTMatchers.
Also it would be helpful for getting right qualified names, etc.

One of possible usage pattern is:
1. Prepare several pieces of code of interest, place into real code in order to get right context. Highlight these codes somehow, maybe with some special macro.
2. Run tool over these units, and get strict predicates as the result.
3. Adjust predicates, combine them, relax conditions, etc, and finally use them

Small proof-of-concept:
tool: http://ideone.com/XetAL (based on remove-cstr-calls, same build options)
test_case: http://ideone.com/GXEcw
tool output: http://ideone.com/xEgFJ

What do you think about such concept?

Best Regards,
Evgeny

Hello,

I have played today with ASTMatchers/Refactoring - it is great tool,
thank you guys!

I am thinking about tool which would automatically generate strictest
possible (or at least very strict) matcher predicate for given piece of
code. That would help to “dive into” predicate syntax for new users of
ASTMatchers.
Also it would be helpful for getting right qualified names, etc.

One of possible usage pattern is:

  1. Prepare several pieces of code of interest, place into real code in
    order to get right context. Highlight these codes somehow, maybe with
    some special macro.
  2. Run tool over these units, and get strict predicates as the result.
  3. Adjust predicates, combine them, relax conditions, etc, and finally
    use them

Small proof-of-concept:
tool: http://ideone.com/XetAL (based on remove-cstr-calls, same build
options)
test_case: http://ideone.com/GXEcw
tool output: http://ideone.com/xEgFJ

What do you think about such concept?

Hi Evgeny,

I love the idea! In general, I think it’s a really hard problem (which is the main reason we haven’t tackled this yet ourselves). I guess it’s a little more researchy, but I’d be super interested in seeing more work going into making C++ tooling easier and more powerful …

I don’t know whether you’re aware, but in the tooling branch there’s also a proof-of-concept implementation for dynamic matcher generation, and it might make sense to base your stuff on that.

Cheers,
/Manuel

17.06.2012 19:53, Manuel Klimek wrote:

I love the idea! In general, I think it’s a really hard problem (which is the main reason we haven’t tackled this yet ourselves). I guess it’s a little more researchy, but I’d be super interested in seeing more work going into making C++ tooling easier and more powerful …

I am not sure how hard it could be in general - I saw only small part of Clang’s ASTTree/ASTMatchers.
But in any case, it is possible to restrict tool to only small subset of expressions, but still making tool useful.

I see different approaches to implement such kind of tool:

  1. Naive and straightforward way: just as external tool to ASTMatchers, same method that I used in example above. Such approach would result in some “knowledge” duplication, i.e. for each predicate it would be additional some code in predicate-generation tool.
  2. More complex way, but with some “knowledge” reuse from ASTMatchers. That approach would require re-describing predicates with some higher/another abstraction, what would allow us to generate both predicates and predicates-generation stuff from one source of “knowledge”. I am not sure if such approach could be done in general, requires some research.

I don’t know whether you’re aware, but in the tooling branch there’s also a proof-of-concept implementation for dynamic matcher generation, and it might make sense to base your stuff on that.

Are you talking about dynamic parseMatcher (as in ast-query example)?
Or maybe about some interactive (maybe gui) tool for building predicates? I remember that Chandler mentioned about something similar at Best Regards, Evgeny

17.06.2012 19:53, Manuel Klimek wrote:

I love the idea! In general, I think it’s a really hard problem (which is the main reason we haven’t tackled this yet ourselves). I guess it’s a little more researchy, but I’d be super interested in seeing more work going into making C++ tooling easier and more powerful …

I am not sure how hard it could be in general - I saw only small part of Clang’s ASTTree/ASTMatchers.
But in any case, it is possible to restrict tool to only small subset of expressions, but still making tool useful.

I see different approaches to implement such kind of tool:

  1. Naive and straightforward way: just as external tool to ASTMatchers, same method that I used in example above. Such approach would result in some “knowledge” duplication, i.e. for each predicate it would be additional some code in predicate-generation tool.
  2. More complex way, but with some “knowledge” reuse from ASTMatchers. That approach would require re-describing predicates with some higher/another abstraction, what would allow us to generate both predicates and predicates-generation stuff from one source of “knowledge”. I am not sure if such approach could be done in general, requires some research.

My hunch is that for (2) the matchers would be sufficiently different from what they are now, that it would end up like (1) anyway - insert the obvious disclaimer that I might be wrong etc etc here :slight_smile:
At least for a first step I think (1) is the way to go - once we have more experience with how the stuff is used, we can then try to figure out how to generalize it…

I don’t know whether you’re aware, but in the tooling branch there’s also a proof-of-concept implementation for dynamic matcher generation, and it might make sense to base your stuff on that.

Are you talking about dynamic parseMatcher (as in ast-query example)?

Yep, that one…

Or maybe about some interactive (maybe gui) tool for building predicates? I remember that Chandler mentioned about something similar at http://www.youtube.com/watch?v=yuIOGfcOH0k&t=27m56s

Now we’re talking the next step :slight_smile: Yea, having a GUI would be great (and just so we’re clear: with GUI I mean a web page :P)

Cheers,
/Manuel

17.06.2012 22:44, Manuel Klimek wrote:

I see different approaches to implement such kind of tool:

  1. Naive and straightforward way: just as external tool to ASTMatchers, same method that I used in example above. Such approach would result in some “knowledge” duplication, i.e. for each predicate it would be additional some code in predicate-generation tool.
  2. More complex way, but with some “knowledge” reuse from ASTMatchers. That approach would require re-describing predicates with some higher/another abstraction, what would allow us to generate both predicates and predicates-generation stuff from one source of “knowledge”. I am not sure if such approach could be done in general, requires some research.

My hunch is that for (2) the matchers would be sufficiently different from what they are now, that it would end up like (1) anyway - insert the obvious disclaimer that I might be wrong etc etc here :slight_smile:

Yes, (2) would require re-implementation of matchers in other terms. But I think API of matchers will be not changed.

At least for a first step I think (1) is the way to go - once we have more experience with how the stuff is used, we can then try to figure out how to generalize it…

Yes, I fully agree. (1) is a good step to start with, at least for prototype.

I don’t know whether you’re aware, but in the tooling branch there’s also a proof-of-concept implementation for dynamic matcher generation, and it might make sense to base your stuff on that.

Are you talking about dynamic parseMatcher (as in ast-query example)?

Yep, that one…

Thank you for note.

Or maybe about some interactive (maybe gui) tool for building predicates? I remember that Chandler mentioned about something similar at http://www.youtube.com/watch?v=yuIOGfcOH0k&t=27m56s

Now we’re talking the next step :slight_smile: Yea, having a GUI would be great (and just so we’re clear: with GUI I mean a web page :P)

And maybe AST database optimized for fast predicate matches :slight_smile:

Best Regards,
Evgeny

17.06.2012 22:44, Manuel Klimek wrote:

I see different approaches to implement such kind of tool:

  1. Naive and straightforward way: just as external tool to ASTMatchers, same method that I used in example above. Such approach would result in some “knowledge” duplication, i.e. for each predicate it would be additional some code in predicate-generation tool.
  2. More complex way, but with some “knowledge” reuse from ASTMatchers. That approach would require re-describing predicates with some higher/another abstraction, what would allow us to generate both predicates and predicates-generation stuff from one source of “knowledge”. I am not sure if such approach could be done in general, requires some research.

My hunch is that for (2) the matchers would be sufficiently different from what they are now, that it would end up like (1) anyway - insert the obvious disclaimer that I might be wrong etc etc here :slight_smile:

Yes, (2) would require re-implementation of matchers in other terms. But I think API of matchers will be not changed.

At least for a first step I think (1) is the way to go - once we have more experience with how the stuff is used, we can then try to figure out how to generalize it…

Yes, I fully agree. (1) is a good step to start with, at least for prototype.

I don’t know whether you’re aware, but in the tooling branch there’s also a proof-of-concept implementation for dynamic matcher generation, and it might make sense to base your stuff on that.

Are you talking about dynamic parseMatcher (as in ast-query example)?

Yep, that one…

Thank you for note.

Or maybe about some interactive (maybe gui) tool for building predicates? I remember that Chandler mentioned about something similar at http://www.youtube.com/watch?v=yuIOGfcOH0k&t=27m56s

Now we’re talking the next step :slight_smile: Yea, having a GUI would be great (and just so we’re clear: with GUI I mean a web page :P)

And maybe AST database optimized for fast predicate matches :slight_smile:

For small projects this might be interesting - for us the question is how that would scale - we’ve found parsing the C++ code to be actually an interesting way to scale the AST, for the small price of needing up 3-4 seconds per TU (on average). Denormalizing the AST itself produces a huge amount of data, and denormalizing even more seems like a non-starter.

Thoughts?
/Manuel

18.06.2012 0:05, Manuel Klimek wrote:

Or maybe about some interactive (maybe gui) tool for building predicates? I remember that Chandler mentioned about something similar at http://www.youtube.com/watch?v=yuIOGfcOH0k&t=27m56s

Now we’re talking the next step :slight_smile: Yea, having a GUI would be great (and just so we’re clear: with GUI I mean a web page :P)

And maybe AST database optimized for fast predicate matches :slight_smile:

For small projects this might be interesting - for us the question is how that would scale - we’ve found parsing the C++ code to be actually an interesting way to scale the AST, for the small price of needing up 3-4 seconds per TU (on average). Denormalizing the AST itself produces a huge amount of data, and denormalizing even more seems like a non-starter.

Thoughts?

It depends on how much you would like to scale. And yes, it also depends on project sizes.
For instance, if required scaling is task per TU - it is one case.
If required scaling is many tasks (tens, hundreds) per TU - it is another story.

Best Regards,
Evgeny

18.06.2012 0:05, Manuel Klimek wrote:

Or maybe about some interactive (maybe gui) tool for building predicates? I remember that Chandler mentioned about something similar at http://www.youtube.com/watch?v=yuIOGfcOH0k&t=27m56s

Now we’re talking the next step :slight_smile: Yea, having a GUI would be great (and just so we’re clear: with GUI I mean a web page :P)

And maybe AST database optimized for fast predicate matches :slight_smile:

For small projects this might be interesting - for us the question is how that would scale - we’ve found parsing the C++ code to be actually an interesting way to scale the AST, for the small price of needing up 3-4 seconds per TU (on average). Denormalizing the AST itself produces a huge amount of data, and denormalizing even more seems like a non-starter.

Thoughts?

It depends on how much you would like to scale. And yes, it also depends on project sizes.
For instance, if required scaling is task per TU - it is one case.

Perhaps I need to expand on what I mean here:
Imagine you have on the order of 100MLOC.
If you want an “AST database” for predicate matches, the question is what indexes you create. If you basically want to create an extra index per “matcher”, the denormalization takes too much data. If you don’t create an index per matcher, how do you efficiently evaluate matchers?

Thoughts?
/Manuel

19.06.2012 16:35, Manuel Klimek wrote:

Or maybe about some interactive (maybe gui) tool for building predicates? I remember that Chandler mentioned about something similar at http://www.youtube.com/watch?v=yuIOGfcOH0k&t=27m56s

Now we’re talking the next step :slight_smile: Yea, having a GUI would be great (and just so we’re clear: with GUI I mean a web page :P)

And maybe AST database optimized for fast predicate matches :slight_smile:

For small projects this might be interesting - for us the question is how that would scale - we’ve found parsing the C++ code to be actually an interesting way to scale the AST, for the small price of needing up 3-4 seconds per TU (on average). Denormalizing the AST itself produces a huge amount of data, and denormalizing even more seems like a non-starter.

Thoughts?

It depends on how much you would like to scale. And yes, it also depends on project sizes.
For instance, if required scaling is task per TU - it is one case.

Perhaps I need to expand on what I mean here:
Imagine you have on the order of 100MLOC.
If you want an “AST database” for predicate matches, the question is what indexes you create. If you basically want to create an extra index per “matcher”, the denormalization takes too much data. If you don’t create an index per matcher, how do you efficiently evaluate matchers?

I understood that part of previous message.
My point was, that if you have 1k translation units and need to scale up to 100k parallel tasks, then it is obvious that “task per TU” is not sufficient, and need to use another approach (maybe pre-parse and split AST).

Best Regards,
Evgeny

19.06.2012 16:35, Manuel Klimek wrote:

Or maybe about some interactive (maybe gui) tool for building predicates? I remember that Chandler mentioned about something similar at http://www.youtube.com/watch?v=yuIOGfcOH0k&t=27m56s

Now we’re talking the next step :slight_smile: Yea, having a GUI would be great (and just so we’re clear: with GUI I mean a web page :P)

And maybe AST database optimized for fast predicate matches :slight_smile:

For small projects this might be interesting - for us the question is how that would scale - we’ve found parsing the C++ code to be actually an interesting way to scale the AST, for the small price of needing up 3-4 seconds per TU (on average). Denormalizing the AST itself produces a huge amount of data, and denormalizing even more seems like a non-starter.

Thoughts?

It depends on how much you would like to scale. And yes, it also depends on project sizes.
For instance, if required scaling is task per TU - it is one case.

Perhaps I need to expand on what I mean here:
Imagine you have on the order of 100MLOC.
If you want an “AST database” for predicate matches, the question is what indexes you create. If you basically want to create an extra index per “matcher”, the denormalization takes too much data. If you don’t create an index per matcher, how do you efficiently evaluate matchers?

I understood that part of previous message.
My point was, that if you have 1k translation units and need to scale up to 100k parallel tasks, then it is obvious that “task per TU” is not sufficient, and need to use another approach (maybe pre-parse and split AST).

I don’t understand the point you’re trying to make here yet :slight_smile:
Are you talking about having the same (parametrized) task done 100k times in parallel (like: find all references to X done by many engineeres), or something else? How would a pre-parsed AST help? Perhaps you can expand on the “obvious” part :wink:

Cheers,
/Manuel

19.06.2012 16:57, Manuel Klimek wrote: