source code rewriting for invalid ast nodes

Hello,

I’m using clang to rewrite (generated) source code. Usually that can be done by running matchers on the ast in combination with a rewriter instance.

Unfortunately this fails in the case of operator[] where the index argument cannot be converted to a valid type (e.g. class X{} x; int * array; int i = array[x]; ). The corresponding ast nodes are missing since the “array[x]” statement has no valid representation.

How can I detect and rewrite the code in cases like above example? “array[x]” → “someFunc(array,x)”

ExternalSemaSource::CorrectTypo dosen’t get called in this case (maybe because all tokens are valid?) so that attempt failed.

Using ExternalSemaSource::LookupUnqualified also failed so far, because i haven’t found a method to get a valid SourceRange for that code part. It would also require manual parsing of that code part which seem like a “dirty” solution to me.

The only idea that I have left is to declare builtin operators for any type with Sema::AddBuiltinCandidate, but that may result in many operator definitions. Also I have no idea how to iterate over all types and how to declare these functions. However this may be the best solution, because then matchers can be used to find and rewrite those code parts.

I would appreciate any help to find a solution.

Regards,

Marc

P.S. I am aware that adding operator int() to class X of above example would allow those statements but that is not an option, since X cannot be represented as int in my case and additional operations need to be performed; such an operator may also mess up other parts of the code.

The general advise is to only do code transformations on valid code. I don’t know enough about your problem to understand why that is not possible.

In general I would also say that doing code transformation should only be done on valid code since one needs to know what actually happens. This particular problem is unfortunately rather specific. Sorry if the given example was not sufficient.

Maybe the problem becomes clearer when class X is defined as wrapper for int with special functionality.

The code that I try to transform is part of a hardware simulation written/generated in c++. The code is guaranteed to be valid except for the partial substitution of int variables by X variables. The resulting invalid code statements are only invalid in the sense that a parameter type is not right. Due to the complexity of the code it is not feasible/possible to predict where type X and where int is used for such operations. Since some operators (e.g. [] for pointers) cannot be overloaded with standard c++ code, errors will come up. For that and other reasons it is necessary to rewrite the code to use custom functions instead of the operator itself. This is where my problem with clang lies. Those nodes are removed from the ast due to missing operator[]/missing type conversion. But those are the nodes i need to preserve in order to run a matcher and transform the code as needed for the simulation. Again as noted before adding operator int() to class X is not a solution since that would create many ambiguity problems.

So to boil the problem further down: How can i forces clang to ignore wrong types that are passed to operators/functions and build the ast with such nodes?

Again, I suspect that adding built-in operators for those cases is the way to go, but I don’t know how to iterate over all types and then create empty dummy functions for that.

I hope this describes my problem sufficiently.

In general I would also say that doing code transformation should only be done on valid code since one needs to know what actually happens. This particular problem is unfortunately rather specific. Sorry if the given example was not sufficient.

Maybe the problem becomes clearer when class X is defined as wrapper for int with special functionality.

The code that I try to transform is part of a hardware simulation written/generated in c++. The code is guaranteed to be valid except for the partial substitution of int variables by X variables. The resulting invalid code statements are only invalid in the sense that a parameter type is not right. Due to the complexity of the code it is not feasible/possible to predict where type X and where int is used for such operations. Since some operators (e.g. [] for pointers) cannot be overloaded with standard c++ code, errors will come up. For that and other reasons it is necessary to rewrite the code to use custom functions instead of the operator itself. This is where my problem with clang lies. Those nodes are removed from the ast due to missing operator[]/missing type conversion. But those are the nodes i need to preserve in order to run a matcher and transform the code as needed for the simulation. Again as noted before adding operator int() to class X is not a solution since that would create many ambiguity problems.

So to boil the problem further down: How can i forces clang to ignore wrong types that are passed to operators/functions and build the ast with such nodes?

Again, I suspect that adding built-in operators for those cases is the way to go, but I don’t know how to iterate over all types and then create empty dummy functions for that.

I hope this describes my problem sufficiently.

I still don’t fully understand what the current situation is.
So you have code that calls array[x] with a class type x? How did you produce that code?

In this particular example “array[x]” was generated externally while I changed the declaration “int x;” to “X x;”. That is the point where i need to patch the generated code in order to fix the missing operator[] error for type X and allow other simulation relevant operations.

I assume you cannot change the code generator?

Why can’t you:

  1. generate the code; parse it with the current version (having the ‘int x’)
  2. find all 'int x’s you wan to change to ‘X x;’; also find all uses of them (including uses in array[x]); output all this information in some format
  3. run over all those cases; now you can change ‘int x’ to ‘X x’ and ‘array[x]’ to ‘myfunc(array, x)’ at the same time
  4. reap benefits; codebase is never in a non-parsing state

Cheers,
/Manuel

First of all thanks for your patience and help with my problem.

Your are right i cannot change the code generator (and it may possibly even be exchanged with another generator but that is future problem).

Your solution seems right to me in the simple case of the example.
But I think it fails for me due to the complexity of the generated code for 2 reasons:

  1. The generated code uses macro defines to declare variables. I only change some defines but i don’t know exactly which variables are affected. (still this may be solvable by changing those defines, record the variables, change them back and record the use, or record macro expansion operations for those cases)

  2. There may be the case that x is not used directly but together with operations(or worse a function call). The resulting type in this case would require that I deduct the type of the operator[] parameter for a statement under the condition of a changed type of x.

I think I will give that solution a try, but it seems complex to implement and very limited in its ability to accept variations in the generated code.

My hope is still that there is a “keep the node in the ast” solution for clang that would allow me to rewrite such a statement based on the parameter type without caring about the exact statement within the brackets.

Greetings,

Marc

First of all thanks for your patience and help with my problem.

Your are right i cannot change the code generator (and it may possibly even be exchanged with another generator but that is future problem).

Your solution seems right to me in the simple case of the example.
But I think it fails for me due to the complexity of the generated code for 2 reasons:

  1. The generated code uses macro defines to declare variables. I only change some defines but i don’t know exactly which variables are affected. (still this may be solvable by changing those defines, record the variables, change them back and record the use, or record macro expansion operations for those cases)

If you know the locations where you change the defines, you can figure out which variables are affected. Clang has all the information about macro expansion stored in the AST / the source-manager.

  1. There may be the case that x is not used directly but together with operations(or worse a function call). The resulting type in this case would require that I deduct the type of the operator[] parameter for a statement under the condition of a changed type of x.

Yep. I’d first look how many of those there are, often, even in > 100MLOC code bases, there are only a handful of corner cases.

I think I will give that solution a try, but it seems complex to implement and very limited in its ability to accept variations in the generated code.

It seems to me that it allows you to be much more specific in addressing what you need.

My hope is still that there is a “keep the node in the ast” solution for clang that would allow me to rewrite such a statement based on the parameter type without caring about the exact statement within the brackets.

But then the question is how can you be sure that it works afterwards? For that it seems like you’d need to know the exact information you would get when analyzing the unchanged code first (which is why I suggested that).

  1. There may be the case that x is not used directly but together with operations(or worse a function call). The resulting type in this case would require that I deduct the type of the operator[] parameter for a statement under the condition of a changed type of x.

Yep. I’d first look how many of those there are, often, even in > 100MLOC code bases, there are only a handful of corner cases.

You are right, after looking through my generated code it doesn’t seem to be a problem.

My hope is still that there is a “keep the node in the ast” solution for clang that would allow me to rewrite such a statement based on the parameter type without caring about the exact statement within the brackets.

But then the question is how can you be sure that it works afterwards? For that it seems like you’d need to know the exact information you would get when analyzing the unchanged code first (which is why I suggested that).

That is why I thought that i may force clang to keep those nodes by declaring empty dummy builtin operator overloads for any pointer type, which I could later replace with the help of the ast. That way I thought I could be sure that the source code transformations are valid.Basically the idea was to use internal clang methods to define something along the lines of:

template

T & operator[](T * array, const X & index){…}

which unfortunately is not a valid c++ statement. Then again I lack knowledge about the inner workings of clang to do that and that idea may be way more complicated/impractical than I thought.

Again thanks for your help. Your solution should be more practical than trying to handle every theoretical case in a general manner as i first intended.

Regards,

Marc Greim