Augmenting CFG with probabilities

Hi,

I’m interested in augmenting the CFG with probabilities in order to determine the probability distribution of outcomes. A paper on the topic is available here and a broader summary of the field is available here.

In short, the usual symbolic execution is augmented with probabilities by computing the data dependence of each branch condition on the assumption that the original input is distributed uniformly.

I’ve read through the introductory material on this page and so I’m wondering, where/how can I learn more detail about what is currently implemented? I’m pretty naive about what the gap is between what I want to achieve and what’s currently in LLVM.

Initially I would be implementing this for my own research goals, but it does have some cool applications to findings bugs in general. If the wider community is interested in it then I would be happy to contribute it, or even better, work with others on it.

Thanks in advance for any advice or feedback.

Cheers.

Jeremy

As far as I know, the dataflow framework (that you’re linking to) is relatively new, so it’s evolving quickly and the interfaces are somewhat unstable. I don’t know much about its current features, but it is likely that (1) it gets more and more features as time goes on (2) you may experience merge conflicts and similar issues (3) documentation is very limited, you can get details from the source code or from the committers who’re regularly adding improvements.

Disclaimer: I know almost nothing about the dataflow framework because I’m working within the “main” framework of Clang Static Analyzer that performs path-sensitive analysis.

Tangentially related observation: you’d probably need manually specified range (or probability distribution) annotations on the “input” variables within the code that you’re analyzing, because assuming uniform distribution on the “full” range of the input type can be a very poor approximation. (E.g. it’s a natural choice to represent an array index or the size of a memory buffer as a 64-bit size_t argument, but then the uniform distribution would assign >>99.99% probability to huge values that correspond to terabytes of memory.)

I can only comment on the documentation of the Clang Static Analyzer. In short, it’s basically non-existent. One needs to read the code to get the picture.
We could give a couple starting points, if you decided that you want to invest into path-sensitive analysis, and you have explicit goals to do/achieve.

I’m fairly skeptical assuming uniform distribution for the inputs.

At first, I thought you refer to the CFG in llvm, but now that I’ve looked at that you refer to the dataflow framework, it must be the clang cfg. Speaking of which you should be warned that for example exceptions (thus their handling edges) the in CFG are not present, and we also could very well have more serious correctness bugs lurking, as the codegen does not use this CFG. The new ClangIR might gives us an actual truthful CFG in the long term future though - if it ever happens.

I can only agree with @DonatNagyE on probabilities, but I’d phrase it differently.
It’s frequently the edge-cases that devs got wrong. The error handling, the preconditions, like nullptr dereference or division by zero, or forgetting to deallocate or deallocating it twice, etc. These might be unlikely, but they can very well bring a system down.
And speaking of tainted data, which could be directly or indirectly controlled by malicious actors.

Thanks for your replies.

First, let me assuage your concerns about the assumption that inputs are distributed uniformly. It’s not by any means meant to be a realistic assumption, but it is the most useful for high-level analysis, although I have not seen it well-justified in the literature. Still, it’s the assumption they use in that paper, Probabilistic Symbolic Execution [1], it’s the same assumption used in probabilistic analysis of algorithms [2] and it’s the assumption that is most useful for my work.

You’re right that a realistic distribution of something like memory size allocated or pointer address (especially null vs not-null) would not be normally distributed and getting a realistic analysis of how frequently paths are taken would require careful choice of distribution everywhere. It’s simply not the kind of analysis that I’m interested in right now. Having said that, I am not at all against others helping to add that level of configurability. Simply getting it to work with one built-in assumption is the first step. :slight_smile:

So, what’s this about two CFGs? One in LLVM and one in Clang? (I’m not even really sure what the boundary is between LLVM and Clang.) What are the different purposes of the two CFGs?

Thanks.

Jeremy

[1] Jaco Geldenhuys, Matthew B. Dwyer, and Willem Visser. 2012. Probabilistic symbolic execution. In Proceedings of the 2012 International Symposium on Software Testing and Analysis (ISSTA 2012). Association for Computing Machinery, New York, NY, USA, 166–176. Probabilistic symbolic execution | Proceedings of the 2012 International Symposium on Software Testing and Analysis

[2] Micha Hofri. 1995. Analysis of algorithms: computational methods and mathematical tools. Oxford University Press, Inc., USA.

Clang is a frontend (lexer, parser, semantic analysis) for C/C++/Objective-C like languages; which then transforms its representation into something that the optimizer llvm-opt and the then the machine code generation can consume (LLVM-IR).

It’s generally accepted, that the closer you are to the user’s source code, the better diagnostics you can provide. This is why diagnostics are generated in the frontend phase. Due to optimizations, (I figure the same applies to debug info btw), instructions might get fused, hoisted, or other things that would make it harder to refer back to the original source code to give sensible and actionable diagnostics; and this is why generally llvm-ir is said not really a good fit for doing (bug-finding) analysis.

This would not prevent us from having a correct CFG in the Clang frontend, but I guess that has historical reasons, as probably you anyways need a CFG in llvm-ir, so a CFG in the frontend was never really a concern for codegen purposes - where correctness actually really matter.

I have no evidence to back this theory, but this is why I think Clang CFG is not as mature as llvm CFG, any why it never will be.

That said, there is/will be a new abstraction layer called Clang IR, which is an MLIR dialect, supposedly bridging the gap - combining the benefits of both worlds such as: a convenient layer for analyses and preserving back references to the relevant AST nodes, to have actionable diagnostics. In that abstraction, their CFG on top of Clang IR would be actually used for lowering into llvm-ir; thus giving us hard guarantee on correctness. And this is what I’m excited about.

Thanks for the explanation!

OK, so the symbolic execution is happening in the Clang CFG, but I wonder if that is the best place for this feature. Reason I wonder that is because this analysis does not provide diagnostics at the source code line level, it provides them at the source function (template) level. If the optimizer can simplify a function’s CFG, then in theory the result of this analysis is the same, and that could prove useful for simplifying the required calculations. However, we should be able to do this analysis on a function template without instantiation (albeit requiring some computer algebra to solve). Since the ability to do the latter is more important than the simplification provided by the former, I guess that puts it squarely into the current Clang CFG after all. Does that make sense?

I’m excited about this future Clang IR too, it sounds ideal. I’ll have to make do with the current Clang CFG until the future arrives.

Cheers.

Jeremy

OK, back again. Things are looking up, as I have a new job and some people in the company actually work on Clang, so I might get some support from there.

I think the conclusion I came to earlier to put this into the Clang CFG still makes sense.

So what’s the simplest edit-compile-test process for the Clang CFG? Ultimately I want to get source annotations from this feature, so does that have to go via clang-tidy or clangd? Assuming this is just for myself for now, what’s the simplest way to get ‘output’?