TreeTransform and Clang

Hello,
  I need to transform the generated AST just before it is passed to the
Sema. I know that I have to use (inherit) the TreeTransform class but it
is difficult for me to find where to put the new transformation (I want
to preserve the architecture of LLVM). Where I can call the new
transformer so that it will be able to transform the tree before the
semantic analysis?
Best regards,
V. Vassilev

I think you're confused; semantic analysis is what builds the AST. We don't
build separate parse trees and semantic trees.

John.

Hello,
  Hm, Is the error handling in the semantic analysis, too?
  I need the following
  For example (I am skipping the pointers and other ugly things that
made it less readable):

  funtion F(int x, float y,string z) {
     int i = objX.GetValue(x); //compile time object and method
invocation
     TFile t = new TFile("MyFile.hhh"); //here goes the interpreter,but
it needs addresses of the variables from the compiler
     i = t.DoSomething(i, x, y, z);
     Console.WriteLine(i.ToString());
  }
  
  Here the semantic analyzer should say undeclared variable and so on.
  Before that happens I want to change the AST and to convert it in something like:

  funtion F(int x, float y,string z) {
     int i = objX->GetValue(x); //compile time object and method invocation
     TFile t = new TFile("MyFile.hhh"); //here goes the interpreter,but it needs addresses of the variables from the compiler
     Context c = new Context();
     c.AddVariable(i.GetDeclaration()) ; //here we should use some kind of reflection or we can insert just the variable address and the proper mapping.
     c.AddArgument(x.GetDeclaration());
     ...
     Interpreter.Interpret("t->DoSomething(i,x,y,z)", c);
     i = t.DoSomething(i, x, y, z);
     Console.WriteLine(i.ToString());
}

  I know I can modify the Sema and/or put something in the symbol table
to achieve the goal but I am looking for more elegant solution. I know
that probably I can make second pass of the invalid AST but it is not
the most efficient way of solving the problem.
  Do you see better solution? (I decided to ask here because I am not
very familiar with the LLVM infrastructure)
Regards,
V. Vassilev

Hello,
Hm, Is the error handling in the semantic analysis, too?

Yes.

I need the following
For example (I am skipping the pointers and other ugly things that
made it less readable):

funtion F(int x, float y,string z) {
    int i = objX.GetValue(x); //compile time object and method
invocation
    TFile t = new TFile("MyFile.hhh"); //here goes the interpreter,but
it needs addresses of the variables from the compiler
    i = t.DoSomething(i, x, y, z);
    Console.WriteLine(i.ToString());
}

Here the semantic analyzer should say undeclared variable and so on.

Yes, it will.

Before that happens I want to change the AST and to convert it in something like:

funtion F(int x, float y,string z) {
    int i = objX->GetValue(x); //compile time object and method invocation
    TFile t = new TFile("MyFile.hhh"); //here goes the interpreter,but it needs addresses of the variables from the compiler
    Context c = new Context();
    c.AddVariable(i.GetDeclaration()) ; //here we should use some kind of reflection or we can insert just the variable address and the proper mapping.
    c.AddArgument(x.GetDeclaration());
    ...
    Interpreter.Interpret("t->DoSomething(i,x,y,z)", c);
    i = t.DoSomething(i, x, y, z);
    Console.WriteLine(i.ToString());
}

I know I can modify the Sema and/or put something in the symbol table
to achieve the goal but I am looking for more elegant solution.

The right way to do this would be to augment Sema's name-lookup facilities, so that your application gets a chance to provide symbols when no other symbols of the same name work. One way to do this is to create your own ExternalASTSource, and override FindExternalVisibleDeclsByName to add those symbols. If I remember correctly, LLDB does this.

I know
that probably I can make second pass of the invalid AST but it is not
the most efficient way of solving the problem.

You won't be able to make a second pass over the invalid AST; there isn't enough information left to produce a reasonable AST.

  - Doug

> Hello,
> Hm, Is the error handling in the semantic analysis, too?

Yes.

> I need the following
> For example (I am skipping the pointers and other ugly things that
> made it less readable):
>
> funtion F(int x, float y,string z) {
> int i = objX.GetValue(x); //compile time object and method
> invocation
> TFile t = new TFile("MyFile.hhh"); //here goes the interpreter,but
> it needs addresses of the variables from the compiler
> i = t.DoSomething(i, x, y, z);
> Console.WriteLine(i.ToString());
> }
>
> Here the semantic analyzer should say undeclared variable and so on.

Yes, it will.

> Before that happens I want to change the AST and to convert it in something like:
>
> funtion F(int x, float y,string z) {
> int i = objX->GetValue(x); //compile time object and method invocation
> TFile t = new TFile("MyFile.hhh"); //here goes the interpreter,but it needs addresses of the variables from the compiler
> Context c = new Context();
> c.AddVariable(i.GetDeclaration()) ; //here we should use some kind of reflection or we can insert just the variable address and the proper mapping.
> c.AddArgument(x.GetDeclaration());
> ...
> Interpreter.Interpret("t->DoSomething(i,x,y,z)", c);
> i = t.DoSomething(i, x, y, z);
> Console.WriteLine(i.ToString());
> }
>
> I know I can modify the Sema and/or put something in the symbol table
> to achieve the goal but I am looking for more elegant solution.

The right way to do this would be to augment Sema's name-lookup facilities, so that your application gets a chance to provide symbols when no other symbols of the same name work. One way to do this is to create your own ExternalASTSource, and override FindExternalVisibleDeclsByName to add those symbols. If I remember correctly, LLDB does this.

I understand. But I want only to replace one node of the AST with
another. I don't need to exchange external symbol names and so on (at
least for now). Is it possible to hook somewhere before the error
handling and substitute the ExpressionStatement (i = t.DoSomething(i, x,
y, z):wink: With new BlockStatement or several ExpressionStatements. Or in
case of error AST is not built (which is strange for me because it
wouldn't be flexible for complex transformations...).

Hello,
Hm, Is the error handling in the semantic analysis, too?

Yes.

I need the following
For example (I am skipping the pointers and other ugly things that
made it less readable):

funtion F(int x, float y,string z) {
   int i = objX.GetValue(x); //compile time object and method
invocation
   TFile t = new TFile("MyFile.hhh"); //here goes the interpreter,but
it needs addresses of the variables from the compiler
   i = t.DoSomething(i, x, y, z);
   Console.WriteLine(i.ToString());
}

Here the semantic analyzer should say undeclared variable and so on.

Yes, it will.

Before that happens I want to change the AST and to convert it in something like:

funtion F(int x, float y,string z) {
   int i = objX->GetValue(x); //compile time object and method invocation
   TFile t = new TFile("MyFile.hhh"); //here goes the interpreter,but it needs addresses of the variables from the compiler
   Context c = new Context();
   c.AddVariable(i.GetDeclaration()) ; //here we should use some kind of reflection or we can insert just the variable address and the proper mapping.
   c.AddArgument(x.GetDeclaration());
   ...
   Interpreter.Interpret("t->DoSomething(i,x,y,z)", c);
   i = t.DoSomething(i, x, y, z);
   Console.WriteLine(i.ToString());
}

I know I can modify the Sema and/or put something in the symbol table
to achieve the goal but I am looking for more elegant solution.

The right way to do this would be to augment Sema's name-lookup facilities, so that your application gets a chance to provide symbols when no other symbols of the same name work. One way to do this is to create your own ExternalASTSource, and override FindExternalVisibleDeclsByName to add those symbols. If I remember correctly, LLDB does this.

I understand. But I want only to replace one node of the AST with
another. I don't need to exchange external symbol names and so on (at
least for now). Is it possible to hook somewhere before the error
handling and substitute the ExpressionStatement (i = t.DoSomething(i, x,
y, z):wink: With new BlockStatement or several ExpressionStatements.

You might be able to hack Sema to do this, but I don't see any clean way to do it.

Or in
case of error AST is not built (which is strange for me because it
wouldn't be flexible for complex transformations...).

Complex transformations generally can't be performed correctly unless you start with correct inputs. Otherwise, how would you know that you've generated the correct output?

  - Doug

>>
>>> Hello,
>>> Hm, Is the error handling in the semantic analysis, too?
>>
>> Yes.
>>
>>> I need the following
>>> For example (I am skipping the pointers and other ugly things that
>>> made it less readable):
>>>
>>> funtion F(int x, float y,string z) {
>>> int i = objX.GetValue(x); //compile time object and method
>>> invocation
>>> TFile t = new TFile("MyFile.hhh"); //here goes the interpreter,but
>>> it needs addresses of the variables from the compiler
>>> i = t.DoSomething(i, x, y, z);
>>> Console.WriteLine(i.ToString());
>>> }
>>>
>>> Here the semantic analyzer should say undeclared variable and so on.
>>
>> Yes, it will.
>>
>>> Before that happens I want to change the AST and to convert it in something like:
>>>
>>> funtion F(int x, float y,string z) {
>>> int i = objX->GetValue(x); //compile time object and method invocation
>>> TFile t = new TFile("MyFile.hhh"); //here goes the interpreter,but it needs addresses of the variables from the compiler
>>> Context c = new Context();
>>> c.AddVariable(i.GetDeclaration()) ; //here we should use some kind of reflection or we can insert just the variable address and the proper mapping.
>>> c.AddArgument(x.GetDeclaration());
>>> ...
>>> Interpreter.Interpret("t->DoSomething(i,x,y,z)", c);
>>> i = t.DoSomething(i, x, y, z);
>>> Console.WriteLine(i.ToString());
>>> }
>>>
>>> I know I can modify the Sema and/or put something in the symbol table
>>> to achieve the goal but I am looking for more elegant solution.
>>
>> The right way to do this would be to augment Sema's name-lookup facilities, so that your application gets a chance to provide symbols when no other symbols of the same name work. One way to do this is to create your own ExternalASTSource, and override FindExternalVisibleDeclsByName to add those symbols. If I remember correctly, LLDB does this.
> I understand. But I want only to replace one node of the AST with
> another. I don't need to exchange external symbol names and so on (at
> least for now). Is it possible to hook somewhere before the error
> handling and substitute the ExpressionStatement (i = t.DoSomething(i, x,
> y, z):wink: With new BlockStatement or several ExpressionStatements.

You might be able to hack Sema to do this, but I don't see any clean way to do it.

So I guess I have to find where it handles the unknown symbol references
and then extend the logic, right?

> Or in
> case of error AST is not built (which is strange for me because it
> wouldn't be flexible for complex transformations...).

Complex transformations generally can't be performed correctly unless you start with correct inputs. Otherwise, how would you know that you've generated the correct output?

Yes you are right but I don't know why there is no generation of parse
tree which is correct according to the grammar and then to be passed for
semantic validation.
I think that is the easiest way to extend the semantics of the custom
llvm modifications as in my case.
For example the project I am working on needs advanced context
information (if there is statement like "TFile f = new
TFile(OpenDialog.Execute(Opendialog.Filename));") then it is possible to
have someobj.SomeMethod(SomeParams))
If I had a parse tree I would transform it easily, because I know the
exact semantics of what I am trying to achieve. Actually I will make one
semantic validation pass before the actual one. I need the parse tree
and after that I will describe what is the semantics of it.
According to me, if we had distinction between these to phases syntax
and semantic analysis we can improve the user-side semantics.
Syntax -> AST; CustomTransform1 -> AST, CustomTransform1 -> AST,
CustomTransform1->AST;Semantics -> AST;
The main advantage of that would be that if it doesn't pass the actual
semantic validation it will fail and the code generator will be still
the same.
I don't know if my idea is clear enough. But if you are interested I can
describe it more ...

Because it's expensive to create, we haven't found it to be useful, and it's pretty
much hopeless anyway. C does not have an unambiguous context-free
grammar; the C++ grammar is the same, only much more so.

John.

I am thinking about 2 passes:
1-st pass I can only put "TFile t = new TFile("MyFile.hhh");", which is
valid construction and it will be marker for the next pass.
2-nd pass I will use the TreeTransform to extend the block with the
custom invocations of the interpreter.
I need to use some kind of preprocessors to achieve that, because at the
first pass I need to escape the "unknown" for the sema symbols, right?
After that I need somehow to send them to TreeTransformer. Can you give
me basic directions or better approach to achieve my goal?

Best regards,
V. Vasilev

Hello,
Hm, Is the error handling in the semantic analysis, too?

Yes.

I need the following
For example (I am skipping the pointers and other ugly things that
made it less readable):

funtion F(int x, float y,string z) {
  int i = objX.GetValue(x); //compile time object and method
invocation
  TFile t = new TFile("MyFile.hhh"); //here goes the interpreter,but
it needs addresses of the variables from the compiler
  i = t.DoSomething(i, x, y, z);
  Console.WriteLine(i.ToString());
}

Here the semantic analyzer should say undeclared variable and so on.

Yes, it will.

Before that happens I want to change the AST and to convert it in something like:

funtion F(int x, float y,string z) {
  int i = objX->GetValue(x); //compile time object and method invocation
  TFile t = new TFile("MyFile.hhh"); //here goes the interpreter,but it needs addresses of the variables from the compiler
  Context c = new Context();
  c.AddVariable(i.GetDeclaration()) ; //here we should use some kind of reflection or we can insert just the variable address and the proper mapping.
  c.AddArgument(x.GetDeclaration());
  ...
  Interpreter.Interpret("t->DoSomething(i,x,y,z)", c);
  i = t.DoSomething(i, x, y, z);
  Console.WriteLine(i.ToString());
}

I know I can modify the Sema and/or put something in the symbol table
to achieve the goal but I am looking for more elegant solution.

The right way to do this would be to augment Sema's name-lookup facilities, so that your application gets a chance to provide symbols when no other symbols of the same name work. One way to do this is to create your own ExternalASTSource, and override FindExternalVisibleDeclsByName to add those symbols. If I remember correctly, LLDB does this.

I understand. But I want only to replace one node of the AST with
another. I don't need to exchange external symbol names and so on (at
least for now). Is it possible to hook somewhere before the error
handling and substitute the ExpressionStatement (i = t.DoSomething(i, x,
y, z):wink: With new BlockStatement or several ExpressionStatements.

You might be able to hack Sema to do this, but I don't see any clean way to do it.

I am thinking about 2 passes:
1-st pass I can only put "TFile t = new TFile("MyFile.hhh");", which is
valid construction and it will be marker for the next pass.
2-nd pass I will use the TreeTransform to extend the block with the
custom invocations of the interpreter.
I need to use some kind of preprocessors to achieve that, because at the
first pass I need to escape the "unknown" for the sema symbols, right?

I don't know what you plan to do with "unknown" symbols, but I can't think of any approach that is likely to work. The parser needs to know what every symbol is so that it can inform the semantic analysis module, which then builds the ASTs. An entity with "unknown" type won't pass semantic analysis and therefore we won't get an AST for it. There's not really any way around with problem, since C++ is so ridiculously context-sensitive.

Your best bet is to intercept name lookup and provide a proper, fully-formed symbol at the time when the parser needs it. Then, you'll get well-formed ASTs from well-formed code.

After that I need somehow to send them to TreeTransformer. Can you give
me basic directions or better approach to achieve my goal?

Perhaps we should step back a bit, because I don't think TreeTransform is what you need: what are you trying to accomplish with Clang?

  - Doug

Hello Doug,
  Sorry for the delayed answer but I tried to dig a little bit in the
problem. After all I am going to try what you suggested:

The right way to do this would be to augment Sema's name-lookup facilities, so that your application gets a chance to provide symbols when no other symbols of the same name work. One way to do this is to create your own ExternalASTSource, and override FindExternalVisibleDeclsByName to add those symbols. If I remember correctly, LLDB does this.

If I have understood correctly I need to create a new class in the
interpreter which is actually new ExternalASTSource. Then the Sema is
going to check for the symbol first in the most inner scope then outer
and finally it will check the ExternalASTSource class if it can
recognize the symbol. If it does then it is going to put what I want in
the AST right?
Where I have to register the new class in order to be found by clang
sema?

Best regards,
V. Vassilev

No, ExternalASTSource doesn't work that way.
For C, the EASTS is queried the moment the identifier is first encountered in a source file. It only works for global identifiers.
For C++, the moment you do name lookup in a given context, the context checks if it has names in the external source (a flag that currently can only be set by that external source, if the context itself originates there) and loads them all. For a function that is in the source, the EASTS will never be queried. For the global context, the source will be queried at the very first name lookup done.
We will change the C++ case soon, but probably not in a way that really helps you, given that you want the contents of the external source modified in the middle of a function.

Sebastian