[RFC] Revival of numerical sanitizer

Introduction.

Floating point computations have several mathematical properties that may result in producing an unexpected result without “doing anything wrong”. Examples include loss of precision, accumulated rounding errors, numerical artifacts related to special values (NaNs, infinities).
The RFC ⚙ D97854 [RFC][nsan] A Floating-point numerical sanitizer. (Clement Courbet) introduced a relatively simple compiler-based instrumentation ([2102.12782] NSan: A Floating-Point Numerical Sanitizer) for automatic tracking of precision based on shadowing the computations using higher precision types. The technique has been demonstrated on several examples / tests.

Example Consider the classical summation problem:
#include <random>
#include <vector>
#include <stdio.h>


// Naive  summation.
template <typename T>
T NaiveSum(const std::vector<T>& values) {
  	T sum = 0;
  	for (T v : values) {
    		sum += v;
  	}
  	return sum;
}


// Kahan's summation 
// https://en.wikipedia.org/wiki/Kahan_summation_algorithm
template <typename T>
T KahanSum(const std::vector<T>& values) {
        T sum = 0;
  	T c = 0;
  	for (T v : values) {
    		T y = v - c;
    		T t = sum + y;
    		c = (t - sum) - y;
   		sum = t;
  	}
 	return sum;
}

int main() {
 	using FLT = float;
  	std::vector<FLT> values;
 	constexpr const int kNumValues = 1000000;
  	values.reserve(kNumValues);
  	// Using a seed to avoid flakiness.
  	constexpr uint32_t kSeed = 0x123456;
  	std::mt19937 gen(kSeed);
  	std::uniform_real_distribution<FLT> dis(0.0f, 1000.0f);
  	for (int i = 0; i < kNumValues; ++i) {
    		values.push_back(dis(gen));
  	}
  	const auto trueSum = KahanSum(values);
  	printf("true sum: %.8f\n", trueSum);
  	const auto sum = NaiveSum(values);
  	printf("sum: %.8f\n", sum);
  	return 0;
}
 
WARNING: NumericalStabilitySanitizer: inconsistent shadow results while checking return value
float        precision  (native): dec: 500093664.00000000000000000000  hex: 0x1.dced2e00000000000000p+28
double       precision  (shadow): dec: 500119719.80826514959335327148  hex: 0x1.dcf38a7ceea770000000p+28
shadow truncated to float       : dec: 500119719.80826514959335327148  hex: 0x1.dcf38a7ceea770000000p+28
Relative error: 0.00520991419317334923% (2^9 epsilons)
Absolute error: 0x1.971f3ba9dc0000000000p+14
(814 ULPs == 2.9 digits == 9.7 bits)


    #0 0x55ab59fe0b48 in float NaiveSum<float>(std::vector<float, std::allocator<float> > const&) (
llvm-project/build/sum.exe+0x3cb48)
    #1 0x55ab59fdf915 in main (
llvm-project/build/sum.exe+0x3b915)
    #2 0x7f0db56216c9 in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16
    #3 0x7f0db5621784 in __libc_start_main csu/../csu/libc-start.c:360:3
    #4 0x55ab59fad5c0 in _start 

Proposition.

https://github.com/llvm/llvm-project/pull/85916 reintroduces ⚙ D97854 [RFC][nsan] A Floating-point numerical sanitizer. largely following the practices for the existing sanitizers. The list of applications of this tool includes testing of specialized software dealing with numerical algorithms and debugging numerical computations. Potentially this also might be helpful for tracing portability issues between different platforms.

Implementation.

The implementation naturally follows the existing framework for sanitizers and includes

  1. LLVM: instrumentation pass
  2. compiler-rt: new helper routines that perform checks / expose shadow values.
  3. Clang: minimal codegen changes to add a function attribute and minimal driver changes.

Potential extensions.

Similarly to the thread sanitizer the proposed instrumentation is not specific to C/C++ and can be integrated with other llvm-based toolchains (e.g. Flang, Swift, Rust).

cc: @vitalybuka , @legrosbuffle, @echristo, @arsenm, @andykaylor, @efriedma-quic

7 Likes

I like the idea of this tool. It doesn’t seem quite as thorough as the other sanitizers, but that’s at most just a matter of positioning and setting user expectations.

I haven’t reviewed the code from D97854 in detail, but I did read through the paper. I have a few questions about the optimization and synchronization of the shadow calculations with the original IR.

In general, what guarantees do you have that the shadow IR will be optimized in the same way as the original IR? Or does that matter? I know we have broad intentions of preserving the semantics described by the IR, but there are some cases where we do things that could result in differences.

One example that comes to mind is vector reductions. With OpenMP, for example, the user can give up numerical reproducibility for a reduction loop in order to have it vectorized. When you increase the data size, that will change the order in which the reduction operations are performed.

Something similar can happen with peel and remainder loops that might change based on data alignment. I’m not sure the LLVM vectorizer does this, and again it depends on the user allowing numerical differences with vectorized code, but if you’re using something like SVML the vectorized versions of function calls can return different results than the scalar calls.

There was some discussion in the paper about handling direct bit manipulation, but I’m not sure I followed it. This is an area where the LLVM optimizer has been enhanced recently. For example, it can now recognize bit-manipulation fabs idioms and convert them to fabs intrinsics. Does your sanitizer accomodate this in some way?

Do you intend to make any allowance for denormal flushing? I expect this would be a significant difference between the lower and higher precision implementations. Is that the kind of thing you intend for your sanitizer to detect? Or could the lack of flushing in the shadow calculations lead to incorrect conclusions?

Finally, I mentioned this in a comment in the original code review, but I’ll reiterate it here. I would love to see this kind of technology extended to help detect numeric divergence introduced by fast-math optimizations. The general idea being that you would enable fast-math everywhere for a program and then when it inevitably produces unsatisfactory results, you’d run the tool to find out where problems were being introduced by fast-math then disable it locally and try again. I think that you could do something similar to what you’re proposing with nsan except use the same precision for the shadow stack but without the fast-math flags set. Do you think such an extension would be possible (assuming, of course, that the initial implementation lands successfully)?

1 Like

:white_check_mark:
I’ll take a quick look at this patch today, but probably more focused on the style.

There has been significant offline interest in this revival.
I have chatted with @ashaposhnikov, who mentioned that several companies and organizations expressing enthusiasm.
I personally mentioned this effort last December to a few folks working on floating-point arithmetic for C library functions,
and they’re interested in the feature. While they support the idea, building llvm-project isn’t familiar territory for them, and they’d prefer having the code in the tree for further suggestions.

However, there’s room for improvement in the process.
Ideally, future RFCs could be posted sooner to generate early discussion. While this revival builds on @legrosbuffle’s well-supported earlier work (⚙ D97854 [RFC][nsan] A Floating-point numerical sanitizer.), the delay highlights the need for a streamlined process for future revivals.
To better gauge industry interest and facilitate collaboration, the discourse post is a place to explicitly record support from companies and organizations in the RFC process.

In light of recent proposals involving instrumentation and compiler-rt, we’re refining the RFC process to establish clearer acceptance criteria.

@andykaylor - thanks a lot for the interest and sorry about the delay.

  1. [optimizations] Similarly to other instrumentation passes, nsan runs very late in the pipeline, in particular, after the main middle-end optimizations.
    It still might interact / affect what runs later (e.g. MIR-level optimizations), but in practice the current placement seems to be acceptable / preserves much of the structure.

  2. [direct bit manipulations] nsan uses two extra regions of memory, the first one (R1) is used to maintain the information about the types stored in the application’s memory, the second one (R2) is the “shadow” memory itself. The relevant section in the paper is Section 3.2.3. For example, If we have a 1 byte store to an address somewhere in the middle of f64, this would update the corresponding byte in R1 and effectively nsan will consider the shadow value as invalid. The general “fallback” approach - if we’ve lost a shadow value we can restart by “extending” the current native one. In practice this happens rarely though.

  3. [denormal flushing] The comparisons are tolerance-based, generally, besides denormal flushing there are quite a few things that affect precision (e.g. some floating point instructions (such as rsqrt) that may return different results on different platforms). So it’s a hard question, it feels like generally there are many things at play and ultimately the conclusions are made by the user / require understanding of what the application code is doing or is intended to do. Speaking about simpler / more obvious use-cases - the tool can help trace/localize the source of NaNs.

  4. [fast-math/openmp] fast-math is an umbrella flag with a few consequences. Don’t want to overpromise here plus need to acknowledge that there might be different opinions. Generally, if we think about changing the order of reductions and related transformations (either done at the source level or by the compiler) - yeah, the tool can help automatically track the precision.

I wasn’t talking specifically about manipulation of memory values. Maybe that’s why that section of the paper confused me. I’m talking about directly manipulating the bits of a “live” value by casting it to an integer. Here’s a snippet from a case I was working with recently.

https://godbolt.org/z/zfx5rdada

This code is trying to return zero with the sign of the input if denormals are being treated as zero by the processor. It’s part of something a math library was doing to handle the DAZ/FTZ case correctly. This would typically appear in the midst of code that was otherwise using standard floating-point operations on the value.

Maybe this is a corner case that you can just document as unsupported. I see this kind of bit-peeking a lot in math library code, but in the kind of algorithms you intend to sanitize it might not be common.