Is that possible for CSA to analyze other languages e.g. Java、 JavaScript by transpiling them to clang compatible AST

I believe this is feasible for several reasons:

  1. There are already a number of toolchains that provide language-to-language conversion, such as Emscripten which can translate C to JavaScript.
  2. C/C++ is sufficiently universal, low-level, and the new versions of C++ have enough features to make such a translation effort easier.
  3. Lastly, and most importantly, this translation would only be for analysis purposes, and would not be responsible for compiling to the final bitcode to run on the target machine. This makes the conversion simpler and further reduces the effort required for the translation.

I suspect if it is good idea.

First, I am confused about the term clang compatible AST. Since the Clang AST is not compatible at all. Different from LLVM IR, which is a language by itself, the clang AST is simply an internal data structure for clang. The clang AST is incompatible between different versions of clang naturally.

Then I doubt that we can translate other higher languages like Java/JS to clang AST simply. Although all of them look similar at the first glance, there should be a lot of differences. Then it would be hard to continue. Although you said we can be tolerant since it is just an analysis, it would be meaningless if we can’t get anything.

I know there are a lot of analysis tools which are based on LLVM IR. You may not be happy with LLVM IR since the LLVM IR may lost many information in the higher level. If it is in the case, I guess you can take a look at GitHub - llvm/clangir: A new high-level IR for clang.. ClangIR is an MLIR-based IR which aims to analysis programs in the higher level. You can visit the RFC [RFC] An MLIR based Clang IR (CIR) for some details. The reason why I recommend ClangIR than AST for you is that ClangIR is a language by itself. So that if it can’t present something well, you can try to improve it actually. I didn’t follow ClangIR closely. I guess @bcardosolopes can share some thoughts here.

This is definitely theoretically possible but I wouldn’t recommend it.

Being a moderately “aggressive” bug-finding tool, no matter how sophisticated the technique is, the static analyzer generally relies on code “making sense” in general, so it can focus on finding code that “doesn’t make sense”. One example of such behavior is relying on absence of dead code in the program. Eg., in the toy example

01 int foo(int *x) {
02     if (x == nullptr) { /*...*/ }
03     return *x;
04 }

the code does not make sense because the null check for x co-exists with an unchecked use of x. The static analyzer emits a warning: “Assuming x is null on line 02, there’s null pointer dereference on line 03”. But another possibility is that the check on line 02 is redundant because nobody ever passes nullptr into the function. This makes the warning a “vacuous truth” at best, but from static analyzer’s perspective it’s a victory nonetheless, because one way or another, the code doesn’t make sense. Like, if it can’t be null, why check? And if it can be null, why use unchecked?

Now, if you feed the static analyzer machine-generated code, the assumption that “the code typically makes sense” becomes invalid. Machine-generated code often fails to make sense and it’s ok, nobody expects it to make sense to begin with, they simply expect it to have the desired properties. So I suspect that you’ll have a massive amount of false positives. Unless the translator produces code very similar to the original code, or otherwise code of very high quality, the static analyzer would needlessly freak out about every redundant operation or defensive check inserted by the machine to make sure the code behaves correctly in the most ridiculous cases.

On top of that, some information that’s available in the original code will be lost in translation, as higher-level constructs of languages like Java will have to be modeled with lower-level constructs of C/C++. This further disconnects the generated source code from the original programmer’s intent, leading to more code that doesn’t necessarily make sense after translation.

Not only this is about the static analyzer understanding the programmer’s intent, but also about the static analyzer being able to explain the bug to the developer in the developer’s own language. The static analyzer doesn’t simply emit one-line warnings, instead it explains execution paths on which problematic things happen, often leading to dozens of notes attached to every warning. If a source-to-source transpiler is used, these notes will need to be translated back to the original language, something that the static analyzer can’t do on its own. Without these notes it would be virtually impossible to understand any of the warnings.

So I think this is going to be bad for the same reason why the static analyzer was built over Clang AST as opposed to, say, LLVM IR. You can think of translation to LLVM IR as if it’s just another source-to-source transpiler, with all the same downsides: original intent often lost in translation, some machine-generated code not necessarily making sense, and explaining the problem in programmer’s preferred terms becoming extremely cumbersome.

Can you build static analysis directly on top of LLVM IR? Absolutely. It’s probably much easier to achieve as well, given that LLVM IR is much simpler than C++. But it’s going to be a very different beast, much less user-friendly, and often less impactful on overall software quality. But with these drawbacks taken into account, such LLVM IR-based static analyzer would be a much better candidate for your experiment, as it wouldn’t rely on an intimate connection to the original source code as much as our static analyzer does.

Agree that machine-generated code can produce a lot of code snippets that make no sense to humans. As you mentioned, machine-generated will produce lots of defensive code as well as adaptation code for different runtime environments. However, this is all because the translation is targeted at execution. If it is only for the analysis scenario, I believe the translated code will be quite intuitive and make sense to both human and static analyzer.
For a toy Java example:

Person p;
if (condition) {
	p = new Person();
}
p.getAge();

can be translated to C++:

Person *p;
if (condition) {
	p = new Person();
}
p->getAge();

This conversion is equivalent in analysis scenario, and we can use CSA to help finding the NPE issue. There is no garbage collector in C++ , but we don’t care about that though.

Source-to-source translation may also encounter same problems with LLVM IR do, but I believe that the problems will be much milder compared to LLVM IR.
First, source-to-LLVM IR is a lossy conversion, as it discards any information that is not related to execution, which can make it difficult or impossible to conduct certain analysis scenarios or decrease the readability of warnings. Source-to-source conversion can generally be performed losslessly.
Additionally, targeting executable IR can cause a significant expansion in code size, but this is also not an issue in source-to-source conversion.

Certainly, there are still many areas that need to be carefully considered, such as the differences in file/package management, or the fact that dynamic languages can dynamically create objects and extend their properties. I think these can be achieved by extending the analyzer.

I mean other languages can compile to certain version of clang AST, let’s say , C++ clang AST, for utilizing CSA. May be “compatible AST” is not that accurate for explaining that.

You can check my last reply to NoQ, I think lossy translation is not for execution, but useful enough for most analysis perspective.

Thx, but I think CIR is another version of AST for C/C++, and the source to source problem will be the same as clang AST

It cannot be though. The original code p.getAge() may throw a null pointer exception, which may be caught by the caller. So the C++ equivalent would need to be, at least, something like

Person *p;
if (condition) {
	p = new Person();
}
if (p != nullptr) {
  p->getAge();
} else {
  throw java::NullPointerException();
}

And just like that, bam, there’s suddenly an extra branch to explore.

Moreover, once you transform code like this, the static analyzer won’t actually be able to warn about null dereference, because the use site becomes guarded by the newly introduced if-statement. And the static analyzer will not necessarily be able to prove that the exception is uncaught (and even if it could, there’s no such check in the analyzer yet).

Another thing a transpiler would definitely do in this example is initialize p with nullptr. And in this case you want that for your purposes as well, otherwise the static analyzer will emit an uninitialized use warning.

As metioned above, the transpiler for the purpose of analysis will not be designed this way, and the checker for Java null pointer dereference just needs to add a try-catch context check

the transpiler for the purpose of analysis will not be designed this way

So you plan to make a transpiler that isn’t suitable for anything else other than such static analysis, due to almost never producing a correct translation? Just to use it as part of a relatively subpar static analysis tool that’s looking for a wrong set of problems for the language (almost every checker other than null dereference falls into this category - leaks, UAFs, uninitialized values), doesn’t know anything about the original language of the program in the first place, and has ten years worth of heuristics built up that are mostly inapplicable to the problem it’s trying to solve today?

I really think this isn’t more productive than making a new static analysis tool specifically for your target language. I also think your chances of overall success are relatively low, and chances of running into fundamental incompatibilities between the languages are pretty high.

It’s definitely possible to construct an intermediate representation suitable for static analysis of different languages (LLVM IR and MLIR being a great example), but neither the C++ language nor the Clang AST are naturally suitable for this, and any attempt to augment them is probably going to be very counter-productive.

I think your best bet is probably to make Clang support Java (at least be able to construct a Clang AST for it), and then implement support for Java-specific AST nodes in the Clang CFG and the Clang static analyzer. That’d eliminate most of my concerns and give you access to the solid foundation our static analyzer is built upon, and let you reuse existing code very smartly without causing cross-language confusion. I think it’s still pretty expensive, you’ll need to have a large healthy team of full-time compiler engineers willing to maintain such facility more or less forever, and it’d require a much larger discussion with the rest of Clang maintainers before you could even start (downstream life is most likely a non-starter for such effort).

So if you’re committed to building such tool anyway, I definitely can’t stop you, but I’m happy to share more (negative) thoughts on this subject. If you’re just looking for a “low-hanging fruit” solution, I’m afraid the answer is no, it doesn’t exist, and even if you find some tools to pile up together like this, the results will most likely be very disappointing.

I wouldn’t put it in terms of another version of AST for C/C++ but I agree that the source to source problems would be similar.