Patches inspired by the Juliet benchmark

I had the pleasure of looking at some of the FNs on the Juliet benchmark for a week.
I focused on the following CWEs in this order:

  1. CWE-787: Out-of-bounds Write CWE-123, CWE-124, CWE-121, CWE-122
  2. CWE-78: OS Command Injection CWE-78
  3. CWE-125: Out-of-bounds Read CWE-126
  4. CWE-22: Path Traversal CWE-23, CWE-36

While looking at the cases, I noticed some repeating patterns and a few easy fixes, along with some hacky and debatable workarounds for catching those FNs.

Given that I only spent a single week on this evaluation, I don’t want to share the results just yet, and the title reflects this. Rather, my intention is to raise awareness that I’d like to upstream some of the quick fixes I applied during that week so that everyone could benefit from this.

I have in total 13 commits, related to:

  • CStringChecker (2)
  • ArrayBoundsV2 (3)
  • GenericTaintChecker (8)

I’d like to upstream them in two batches so that I don’t need to split them from a branch and test them one by one if it’s okay.
The first batch would be for the CStringChecker and ArrayBoundsV2 together, and then the second batch would be for the taint improvements.


Here is the first batch:


Here is the first batch for improving taint analysis:


To set expectations, by landing these 13 commits, we won’t improve the FN rate a lot, it would need a little bit more engineering to cover a significant portion of the FNs. Stay tuned for a follow-up post about the details of what needs to be done to improve on this - regarding the Juliet benchmark. No ETA.

I’ll post here the evaluation of the first batch, once I backported the patches to our fork, scheduled and evaluated the diff.

Wow, sounds exciting! Do you think we would/should look at those benchmarks more often? Would it make sense to have some scripts in the repo that would make running it straightforward?

I don’t have a harness invoking CSA natively.
I don’t plan to implement it, thus scripting is out of scope to me.
We use CSA as a library, and the invocation is custom for our use case. Because of this, I can’t really publish the diff results because we don’t have the tooling for exporting it out of our ecosystem.
We use SARIF btw. as an output format, and we diff SARIFs directly. But SARIFs are not commonly consumed here I believe, for sharing analysis results.

Actually, one could probably use scan-build or CodeChecker to get some reports (I’ve only used the latter though).
Even if that’s done, we should post the findings somewhere, and I don’t have the means. What I can say is that a lot of code gets generated, so storage might be an issue if someone wants to host it.

Maybe Ericsson folks could give it a try and post the results on their demo server. WDYT @DonatNagyE?

My experience so far is that we are not doing great, say it like that.

See the evaluation of the first batch here. FYI nothing ground breaking there.