I would like to propose the addition of RSan (Robustness Sanitizer) to the sanitizers in the compiler. Currently TSan can miss weak behaviours allowed by the C memory model due to synchronization introduced by the sanitizer and as a result miss races. RSan complements TSan by verifying the program does not diverge from sequential consistency even when using weak atomic accesses. Running the two in conjunction will increase the assurance of the program correctness.
Before I rebase the changes, I would like to have the following input:
Should I rebase my changes on main or another version?
Should the sanitizer be part of TSan with an optional flag? I believe this should be the case since they both augment atomic accesses and we’d like to keep them in sync. In addition there’s little reason not to run TSan when running RSan.
What should be the size of the timestamps? Currently in the patch it is 64 bits for maximum correctness. Since TSan itself uses 16 bites for timestamps it seems like the value is too high. Why did TSan opt for 16bit? What would be correct here? 16bits? 32bits?
I had a brief look at the paper and patch. The goal is appreciated, but I don’t see how this can be upstreamed without serious rework.
Unconditional Overhead
The patch adds permanent performance overhead. The enable_robustness_reporting flag only gates reporting call, not the expensive instrumentation work. All the locking and vector clock updates are unconditional. This is a non-starter; any new feature must have zero cost when disabled. Why was this not placed behind a proper feature flag?
Fundamentally Unscalable
The locking design itself is unscalable. The use of a single global locksLock to protect a map of per-address mutexes serializes the instrumentation of all atomic operations, even on unrelated addresses. This design directly explains the 222x worst-case performance regression mentioned in the paper and is not viable for real-world use.
Usability
The tool is overly conservative, flagging benign behaviors (e.g., busy-wait loops) as violations. The proposed solution of requiring manual user annotation shifts a heavy burden onto the developer and requires them to be memory model experts. This seems to undermine the goal of an automated tool. What’s the expected workflow for a developer on a large multi-million line project to triage these violations?
Thank you for the quick reply!
I know the patch isn’t ready to be merged as-is. This is why I wanted to have some discussion before I convert it to a more fitting form.
Unconditional Overhead
I fully agree. My intention is to gate the entire instrumentation behind enable_robustness flag. The flag to gate reporting was used to examine the overhead without the overhead of the prints. The reason it was not gated in the paper artefact is I wanted to compare a clean TSan to TSan+RSan making sure I didn’t touch any code paths in the comparison code.
Additionally, at least at first, I do not want to add support for the value tracking extension as it requires annotation from the user. If we could statically identify these patterns, automated support can be added in the future.
Fundamentally Unscalable
From my understanding and profiling, the worst-case performance overhead is by not optimizing enough the path for SC fences. I have some ideas on how they can be compiled in different manner that would require many more location clocks but won’t require a global lock as they do now. Still, reducing the size of timestamps from 64 bits should help significantly (512bit vectorization removed the overhead).
In either case, the locksLock is held for a very short time and I didn’t perceive it as source of slowdown in my profiling.
Usability
I think most of the cases of busy loops could be identifiable statically, thus we can automate value tracking. Without the value tracking support, adding a fence before the loop can mitigate the problem as for any value read we won’t have a violation (assuming proper sync path).
Regarding locksLock. It is not required for the correctness of the algorithm. We can use some lockless data structure or caching for recently accessed atomics. What is the preferred option for compiler-rt?
This would be useful to have for ticker synchronization algorithms/containers.
Yes, the code must be rebased on top of the main branch.
Yes, I think this should be a TSan mode enabled with an additional flag (but can say more reliably when look at a patch, the diff is hard to review/understand). If you just upload it a PR for an llvm fork (on the older revision, so that you don’t need to rebase), that would be already helpful for initial review.
Timestamps should be consistent across TSan code, if the rest uses 16 bits, this part should use 16 bits as well.
If I understand the paper right (memorize previous values for atomic variables, and return non-default values from loads when MM permits), this is part of the Relacy Race Detector back in 2008:
Later I ported it to TSan v1:
But unfortunately it was never ported to v2/v3 runtime.
I have made the following changes from the paper in this version:
Feature flag for robustness verification being disabled by default.
Remove value tracking. I might have missed some references.
Timestamps use 16bits like TSan.
There is a slight misunderstanding regarding the value tracking. RSan does not simulate the WMM. Value tracking is used for more fine granular understanding if divergence can happen between C++ and SC. It’s a feature to enable strong CAS along with possible Wait and BCAS primitives.
Wait and BCAS: The program will infinitely loop until the expected value is found. Thus we can ignore robustness violations on any value but the expected one.
Strong CAS: The programm cannot read (as CAS failure) the expected value. Thus we can ignore writes of this value have HBSC but not HB to the thread.
Okay, then I misunderstood what this is doing, and missing rationale behind the approach.
Do I understand correctly that what the tool finds is not bugs?
There are useful algorithms that are not sequentially consistent and/or not non linearizable.
Instead of proving that an algorithms is SC, and then using tools that can verify only SC algorithms, why not make tools verify non-SC algorithms?
RSan identifies robustness violations — executions where the WMM diverges from SC. These violations might not be bugs, but many times they can result in bugs. Even experts can get this wrong and bugs will surface when the program is ported to ARM from x86.
While not all useful algorithms are sequentially consistent and/or not non linearizable, many still are. By making sure a program is robust we have SC guarantees while using less synchronization primitives allowing for better performance and changes to optimize the code without introducing weak behaviour.
The reason to try verify using SC instead of verifying non-SC is that verifying non-SC is hard. If we verify the current execution we are susceptible to observer effect introducing synchronization that does not exist in the program, thus missing execution paths that might contain races [example 1.1]. Even if we manage to avoid observer effect, testing on stronger hardware (x86-TSO) will miss executions that happen on weaker hardware (ARM, Power) [1].
Current approaches try to address this by simulating the memory model to access different possible code paths. This is problematic since we either have unbounded memory usage due to storing the entire execution graph [example 7.2] or sacrifice completeness and miss possible executions [example 7.1].