SSA form in Clang

For my project (a subset of C) I'm trying to do optimizations at the
source level language and wondering why Clang doesn't have an SSA form
representation.

Is it not useful, since LLVM does almost all the optimizations?

Someone will come along and correct me if I’m wrong, but my understanding is that Clang does indeed rely very on LLVM to produce optimised code, and very little, if any, optimisation is done in Clang. Clang produces LLVM IR directly from the AST (or at least “nearly directly”, it does some transformations in Sema for example, that introduces explicit casts from implicit casts in the original source - such as converting integers to float or float to integer in arguments or assignments).

This is kind of the beauty of LLVM IR, and certainly helps the approach I’ve taken in my Pascal compiler - of course, you still need to produce “sensible” LLVM IR, as the optimisation will not remove all forms of useless/stupid code that the user could create (as I’ve found out at times).

I have a hard time understanding why you couldn’t use LLVM IR to make the optimisations you want (by adding suitable passes to the compiler) - or was the point to output C source code that is “optimised”? If so, you may want to work at the AST level in Clang instead - there are tools for (recusively) iterating over that too.

You may want to copy the list on your replies (Reply All or similar), as I’m by far no expert on how the analysis at AST level would work.

As they saying goes “The devil is in the detail”, so a large part of the actual answer to your question depends on “what optimisatons you plan to do”. I know there are people who work on/have worked on “source to source” conversions, and (parts of) Clang can absolutely be used for this purpose. Whether anyone has already done something that you can use, whether it does exactly what you want or something similar is a further question, that I haven’t got the answer to.

I am wondering why you would want to do that?
Source code should be optimized for readability and maintainability by
humans, and not necessarily optimized for performance.
LLVM does performance optimizations that would make the resulting C
source code much less readable.

Something called "re-factoring" is normally a useful thing to do at
the source code level, resulting in source code that is easier to
maintain.
Another area might be analyzing the source code and recognising areas
that can be multi-threaded, thus improving performance that way.
Analyzing the source code and identifying areas likely to cause deadlocks.
Essentially, useful things to check for that a compiler does not do.

Kind Regards

James

I get it and agree. However, there is valuable information to be had
at the source level. I realized that after looking through all the
code that goes into identifying loops in LLVM IR. We already know
we're in a loop at the source code level, I fail to understand why
more work isn't done at the clang level.

Though things you have identified are also (probably more) useful things to do.

It’s not quite so clear cut. For example, consider the following:

do { … } while (0);

Looks like a loop in the AST, but clearly isn’t in the IR. This is a common idiom in code that lives in macros.

Or what about:

retry:
  …
  if (fail) goto retry;

Doesn’t look like a loop in the AST, but clearly is in the IR. Loop nests are also often quite different in structure after inlining. Part of the point of doing this in the IR, rather than in a higher-level representation, is that it allows for better canonicalisation. For example, you don’t need to special case all of the variations on:

while(1)
for(;:wink:
do {} while(1)

You also can detect increment conditions much more accurately, whether or not the programmer has put them in the increment part of a for loop.

Programmers tend to be surprised if switching between a for loop and a while loop impacts code generation. Doing the loop optimisations after canonicalisation makes this much less likely to happen.

David

I would add to this that "not all programs are written in C" (or something
resembling C). LLVM is used as a backend for a few other languages too, and
having one set of optimisations that work for "everything" is much better
than looking at the early phase of translation, where the language is still
resembling the source, and you'd have to do things special for each
language (or language construct).

See
http://en.wikipedia.org/wiki/LLVM#Front_ends:_programming_language_support
for a list of languages that use LLVM as the code generation.

Working on common functionality helps everyone, regardless of what language
it is.

If you do not necessarily need to use Clang you might want to consider the ROSE compiler infrastructure [1] as it is a framework explicitly designed for doing source-to-source transformations.
It brings pretty much all the different analyses one would expect to be available in a compiler.
Also its C support should be pretty stable.

Cheers,
JP

[1] www.rosecompiler.org