C++ Scoring Tool

Hello,

For the senior project in my undergraduate studies, my team and I are developing a tool that will evaluate the format and code conventions of a c++ program, outputting a score and displaying useful messages, very much like pylint for python.

The idea is kind of like clang-format except no alterations to the code should be made. The tool would be used as a teaching aid and automatic grader. To handle the beautiful diversity of c++, it shouldn’t constrain the author to any particular style (although it should be able to do that too). For example: open curly braces on same line as function declaration compared to having them on a new line. In this case, the tool could check for consistency only. As long as the entire file has the same format, you will get a perfect score. If, however, there are 10 places of braces on same line and 9 on newline, there will be a penalty to the score, larger than if 18 on same line and 1 on newline. The idea is to enforce consistency without getting in the way of authors preferred style. This should give professors a robust tool to teach c++.

I was hoping the clang community could help me understand the inner workings of clang a little bit better. Right now, my hangup is trying to get format data to work in conjunction with clangs AST. What I’m trying to do is get back the whitespace, comment, and bracket information that is loss during AST buildup. Suppose I want to check that all operators have consistent spacing format, something like “(2 * 2)” verses “(22)" verses "(2 2)”. The AST will be used to get the semantics of that particular operator so as to not get it confused with the array pointer operator, but I need to count the operator whitespace prefix and postfix. The same concept will be applied to statement whitespace circumfixs. If done right, I should be able to refer to all operators the same way no matter the complexity of the expression. Something like “(x - 4) / 3 * (2 +1)” would show an inconsistency in the end part “(2 +1)” because of a missing space.

My first thought was to use the SourceManager locational information to point back to the source code, then process and identify the whitespace from there; However, this seems wildly inefficient and inelegant. My second thought was to somehow get clang to keep the whitespace information and add it to the AST, but I believe there are inherent difficulties with that.

My biggest problem is lack of expertise within clangs source code. Does anybody have any ideas on how I can get clang to give me the information I need to support the above functionality?

Thanks for any interest. I hope this is an appropriate mailing list to post my question.

Daniel.

+cfe-dev

The people familiar with clang-format are more likely active there :slight_smile:

+cfe-dev

The people familiar with clang-format are more likely active there :slight_smile:

Hello,

For the senior project in my undergraduate studies, my team and I are developing a tool that will evaluate the format and code conventions of a c++ program, outputting a score and displaying useful messages, very much like pylint for python.

The idea is kind of like clang-format except no alterations to the code should be made. The tool would be used as a teaching aid and automatic grader. To handle the beautiful diversity of c++, it shouldn’t constrain the author to any particular style (although it should be able to do that too). For example: open curly braces on same line as function declaration compared to having them on a new line. In this case, the tool could check for consistency only. As long as the entire file has the same format, you will get a perfect score. If, however, there are 10 places of braces on same line and 9 on newline, there will be a penalty to the score, larger than if 18 on same line and 1 on newline. The idea is to enforce consistency without getting in the way of authors preferred style. This should give professors a robust tool to teach c++.

I was hoping the clang community could help me understand the inner workings of clang a little bit better. Right now, my hangup is trying to get format data to work in conjunction with clangs AST. What I’m trying to do is get back the whitespace, comment, and bracket information that is loss during AST buildup. Suppose I want to check that all operators have consistent spacing format, something like “(2 * 2)” verses “(22)" verses "(2 2)”. The AST will be used to get the semantics of that particular operator so as to not get it confused with the array pointer operator, but I need to count the operator whitespace prefix and postfix. The same concept will be applied to statement whitespace circumfixs. If done right, I should be able to refer to all operators the same way no matter the complexity of the expression. Something like “(x - 4) / 3 * (2 +1)” would show an inconsistency in the end part “(2 +1)” because of a missing space.

My first thought was to use the SourceManager locational information to point back to the source code, then process and identify the whitespace from there; However, this seems wildly inefficient and inelegant.

This is exactly how you would do that.

That said, why not teach students to use tools to do work for them instead of spending time doing it on their own?

+cfe-dev

The people familiar with clang-format are more likely active there
:slight_smile:

Hello,

For the senior project in my undergraduate studies, my team and I
are developing a tool that will evaluate the format and code
conventions of a c++ program, outputting a score and displaying
useful messages, very much like pylint for python.

The idea is kind of like clang-format except no alterations to the
code should be made. The tool would be used as a teaching aid and
automatic grader. To handle the beautiful diversity of c++, it
shouldn't constrain the author to any particular style (although
it should be able to do that too). For example: open curly braces
on same line as function declaration compared to having them on a
new line. In this case, the tool could check for consistency only.
As long as the entire file has the same format, you will get a
perfect score. If, however, there are 10 places of braces on same
line and 9 on newline, there will be a penalty to the score,
larger than if 18 on same line and 1 on newline. The idea is to
enforce consistency without getting in the way of authors
preferred style. This should give professors a robust tool to
teach c++.

I was hoping the clang community could help me understand the
inner workings of clang a little bit better. Right now, my hangup
is trying to get format data to work in conjunction with clangs
AST. What I'm trying to do is get back the whitespace, comment,
and bracket information that is loss during AST buildup. Suppose I
want to check that all operators have consistent spacing format,
something like "(2 * 2)" verses "(2*2)" verses "(2* 2)". The AST
will be used to get the semantics of that particular operator so
as to not get it confused with the array pointer operator, but I
need to count the operator whitespace prefix and postfix. The same
concept will be applied to statement whitespace circumfixs. If
done right, I should be able to refer to all operators the same
way no matter the complexity of the expression. Something like "(x
- 4) / 3 * (2 +1)" would show an inconsistency in the end part "(2
+1)" because of a missing space.

My first thought was to use the SourceManager locational
information to point back to the source code, then process and
identify the whitespace from there; However, this seems wildly
inefficient and inelegant.

This is exactly how you would do that.

Excellent. I pursued this direction a couple days ago and was surprised at the richness of the SourceManager. I was able to create the RecursiveASTVisitor using the tutorial in clang 6.0 documentation and visit every BinaryOperator and grab the location of the actual operator symbol rather than the operator expression like I initially thought. Then, just as a proof of concept, I was able to parse the char pointer backwards and forwards to grab the whitespaces.

My next thought is to use LexicallyOrderedRecursiveASTVisitor to grab the difference between the current BinaryOperator and the previous token, that way saving from having to do string parsing to get the whitespace.

That said, why not teach students to use tools to do work for them
instead of spending time doing it on their own?

There are two reasons.

It is a similar logic as to why arithmetics are first taught to students when calculators are easier and faster. Supposedly it helps give a good foundation. The university I'm going to now tends to teach students the manual way first before they introduce useful tools.

Since this would be a tool to score c++ code, teachers can incorporate it into some sort of automatic grading script. Whether the student manually formated their code or not. If they couldn't be bothered getting their source in order, they would loose points on the assignment.