AST transformation for P2066

I’m working on the frontend implementation of P1875 (wording changes in P2066), and have some beginner questions about doing code transformations in Clang’s AST. Apologies in advance if this topic should be in #beginners instead.

My goal for now is transforming an atomic do block from something like this

atomic do {
  statements;
}

into something like this

{
  transactionStart();
  statements;
  transactionEnd();
}

where transactionStart() and transactionEnd() are function calls that mark the start and end of a transaction.

While there are many interesting problems to solve down the road (such as parsing the contextual keyword atomic, and handling early exits in the block), currently my biggest hurdle is figuring out how to insert any statement or expression into the AST.

It seems like I can’t just create a Stmt- or Expr-subtyped instance via its constructors. These types have some functions that can create statements and expressions, but those functions require parameters that I’m not able to find the correct values for, such as ASTContext, or SourceLocation. And lastly, I don’t know how a statement can be added to the StmtResult aka ActionResult<Stmt *>-typed return values of Parser::Parse*Statement functions.

I would really appreciate it if anyone can point me to the right directions/resources (such as tutorials, documentation, or any similar thing done in the past) for me to better understand how I can do this kind of transformation in the AST.

Hi, and welcome! I’m not certain if you saw this page in our docs, but it has some high-level information about the AST in Clang: Introduction to the Clang AST — Clang 16.0.0git documentation

Typically, the way we would handle this is to introduce a new AST node to represent the atomic block, and then when doing CodeGen we would lower it to LLVM IR such that it inserts the transaction start/end calls as appropriate. We’d do it this way because that retains the most source fidelity in the AST, which is very useful. For example, that allows us to pretty-print back to the original source code from the AST, support AST pattern matching tools like clang-tidy, etc.

You are correct that we hide most of the constructors for AST nodes and instead use Create() static functions. The ASTContext is basically the “owner” for all of the AST nodes in the compilation job; there’s generally only one instance of it and it’s created quite early in the compilation process. Many objects like Sema will expose a getASTContext() function or a Context data member to hold a reference to the ASTContext. SourceLocations are something you get out of the lexer; typically you’ll get a Token object while parsing and can use Token::getLocation() to get a SourceLocation object representing the location of that token in the source file. As for StmtResult, you don’t have to make any changes there – your new AST node will derive from Stmt anyway.

The “Clang” CFE Internals Manual — Clang 16.0.0git documentation has quite a bit of information about this kind of stuff and may also be a good reference for you.

Btw, if you’re thinking of working on transactional memory, you should propose it with an RFC here in Discourse. There was a previous thread gauging community interest (interest in support for Transactional Memory?) that did not get a lot of attention but did raise some questions for things to keep in mind. I would recommend starting a new RFC that lays out why you think support for this is good, what you plan to do, etc. You don’t have to do that just to play around with the implementation work to see how much effort is involved, but you will have to do it by the time you’re ready to submit patches to Clang to land your work.

Thank you for looking into this!

1 Like

Hi, thanks for the welcome and this detailed response!

I saw this page. Although, the only things I learnt from it were how much information the AST carries and how to dump the AST and find the parts that interest you. It doesn’t really go into modifying the AST, but indeed it gives some (very) high-level overview of it.

This is a very helpful suggestion. What you wrote makes a lot of sense to me. Since the AST should be a faithful representation of the source code, I shouldn’t insert into it things that are not from the source. I tried creating a new AST node earlier, and it was very straightforward, especially since the atomic do statement does nothing other than wrapping over a compound statement in the AST.

But at this step, I do have a follow-up question, if you don’t mind: Could you point me to the entry point of CodeGen, and/or some documentation about it? I was only able to find very little information about it online. Both the internal manual and the Hacking on Clang tutorial have only very short summaries of what CodeGen/IRGen does, without any information on how it works or how I might be able to modify it.

Thanks for letting me know of this contribution process. I am not very familiar with it. I don’t have a fully fleshed out plan yet on what exactly I need to do in Clang, so for now I think I’ll just play around with the implementation work to find out what needs to be done by doing it, and hold off from starting a RFC until I think I’m ready.

Thanks again for the helpful response!

I don’t know that we have any more documentation on the design of CodeGen, but at a high level, it works by walking over the AST generated by Sema using an AST consumer. You will need to modify CodeGenFunction to have a new Emit function to emit the code for your new AST node. Most likely you’ll want to call that from CodeGenFunction::EmitStmt() or CodeGenFunction::EmitSimpleStmt(). Based on what you’ve mentioned above, it should mostly be a matter of emitting the correct call statement to start the transaction, call EmitCompoundStmt() to emit the contents of the block, then emit the correct call statement to end the transaction.