[ubsan] Add -fsanitize-warn-once, only emit runtime error once per check

This flag causes clang to emit a byte for each check that is used by the
runtime to track whether we've already printed an error for that check.

Often failed checks are triggered many times dynamically, but a user
is only interested in which checks failed (with example dynamic values
to aid in debugging). This flag lets the user make such runs much
more efficient and generate more manageable output.

0001-ubsan-Add-fsanitize-warn-once-only-emit-runtime-erro.patch (16.9 KB)

0001-ubsan-Support-fsanitize-warn-once-only-warn-first-ti.patch (16.5 KB)

slows.pdf (5.44 KB)

Whoops, missed the ubsan commit yesterday so those don't apply cleanly.

Updated patches attached!

0001-ubsan-Add-fsanitize-warn-once-only-emit-runtime-erro.patch (15.6 KB)

0001-ubsan-Support-fsanitize-warn-once-only-warn-first-ti.patch (17.7 KB)

Hi Will,

+ if (Checked) {
+ if (*Checked) return;
+ *Checked = true;
+ }

Does it make sense to do the store atomically? The user's program is
already buggy, but introducing a possible data race is unfortunate.

Dmitri

Hi Dmitri,

Glad you brought this up. I wasn't sure which way to go on this and
erred on simplicity. Attached is an updated compiler-rt patch using
__sync_val_compare_and_swap, which also simplifies the code a bit. If
this builtin is sufficiently portable (architectures and compiler
recognition) then I would prefer this for the reasons you mention.

Thanks!

0001-ubsan-Add-fsanitize-warn-once-only-emit-runtime-erro.patch (15.6 KB)

0001-ubsan-Support-fsanitize-warn-once-only-warn-first-ti.patch (17.7 KB)

Have you measured the code size overhead from this extra flag? Did you
consider implementing this in the runtime library instead (by
suppressing duplicates based on return address or SourceLocation)?

Have you measured the code size overhead from this extra flag? Did you
consider implementing this in the runtime library instead (by
suppressing duplicates based on return address or SourceLocation)?

+1. ThreadSanitizer has to solve the same problem - we want to report
each data race (pair of stack traces) exactly once. TSan runtime stores the
stacks of printed reports (as a sequence of PCs) to do this de-duplication.

Have you measured the code size overhead from this extra flag? Did you
consider implementing this in the runtime library instead (by
suppressing duplicates based on return address or SourceLocation)?

Hmm, good point. I hadn't previously measured this, but I just did
now, see attached. Sizes are reported for text/data/bss/total as
reported by 'size', and ubsan calls are counted by occurrences of
calls in the actual resulting binary (grep'ing output from objdump
-d).

Summary: average increase across CINT2006 was 1.85% using
-fsanitize=integer with a neatly consistent ~5bytes per check
(including text).

For the ability to scale despite many sites triggering many times,
this does seem like a valid trade-off.

Previously IOC used a short linear scan table (~10-20 elements was
sweet spot IIRC) with fallback to a larger hashtable to manage
duplicates, but that was always a performance issue. As a useful data
point, a quick spot-check of 403.gcc shows 96 static locations
triggered a total of 3,476,066 times dynamically when just processing
one of the inputs used for the 'ref' input set (166.i). More on this
below.

+1. ThreadSanitizer has to solve the same problem - we want to report
each data race (pair of stack traces) exactly once. TSan runtime stores the
stacks of printed reports (as a sequence of PCs) to do this de-duplication.

Great, didn't realize TSan already solved this problem. That said,
the problem is somewhat different I think:

* TSan supports differentiating based on stack trace (apparently), but
that seems less interesting for ubsan/integer checks, especially since
we don't print that information :). The byte-per-check approach
doesn't work for stack traces, so that's not really an option for tsan
as-is.
* I would (perhaps erroneously) expect tsan to have many fewer dynamic
invocations than ubsan/integer checks, which might suggest difference
trade-offs in the size vs performance department. Checking a byte 1
million times vs scanning and managing a vector of ~100 items > 1
million times might make the size increase rather preferable, even if
that's not the right decision for tsan (I have no idea what's right
for tsan :)).

Given this, and in light of the attached data, do you buy that this is
indeed the appropriate approach for ubsan/integer checking?

Until I actually gathered this data I wasn't sure, but the 1-2% seems
very much worth it IMHO.

Thoughts, and thanks for helping ensure we do what's right!

~Will

size.pdf (45.7 KB)

Have you measured the code size overhead from this extra flag? Did you
consider implementing this in the runtime library instead (by
suppressing duplicates based on return address or SourceLocation)?

Hmm, good point. I hadn’t previously measured this, but I just did
now, see attached. Sizes are reported for text/data/bss/total as
reported by ‘size’, and ubsan calls are counted by occurrences of
calls in the actual resulting binary (grep’ing output from objdump
-d).

Summary: average increase across CINT2006 was 1.85% using
-fsanitize=integer with a neatly consistent ~5bytes per check
(including text).

For the ability to scale despite many sites triggering many times,
this does seem like a valid trade-off.

Previously IOC used a short linear scan table (~10-20 elements was
sweet spot IIRC) with fallback to a larger hashtable to manage
duplicates, but that was always a performance issue. As a useful data
point, a quick spot-check of 403.gcc shows 96 static locations
triggered a total of 3,476,066 times dynamically when just processing
one of the inputs used for the ‘ref’ input set (166.i). More on this
below.

+1. ThreadSanitizer has to solve the same problem - we want to report
each data race (pair of stack traces) exactly once. TSan runtime stores the
stacks of printed reports (as a sequence of PCs) to do this de-duplication.

Great, didn’t realize TSan already solved this problem. That said,
the problem is somewhat different I think:

  • TSan supports differentiating based on stack trace (apparently), but
    that seems less interesting for ubsan/integer checks, especially since
    we don’t print that information :). The byte-per-check approach
    doesn’t work for stack traces, so that’s not really an option for tsan
    as-is.
  • I would (perhaps erroneously) expect tsan to have many fewer dynamic
    invocations than ubsan/integer checks, which might suggest difference
    trade-offs in the size vs performance department. Checking a byte 1
    million times vs scanning and managing a vector of ~100 items > 1
    million times might make the size increase rather preferable, even if
    that’s not the right decision for tsan (I have no idea what’s right
    for tsan :)).

TSan inserts a call to runtime library for each function entry/exit and
for each memory operation (load/store). Still, data races don’t happen that
often, so we have to access hashtable which stores stack traces of
printed reports on slow path only. We haven’t observed any performance
problems here (Dmitry may correct me if I’m wrong).

I didn’t extensively test ubsan on real-world applications so it’s hard for me
to estimate the number of error reports it prints. But I think we need to count
not the number of calls to __ubsan_handle (i.e. number of places in code where
an error might happen), but the number of actual unique reports printed by ubsan.
If, say, it’s at most 10-20, then storing PCs of all the erroneous instructions and doing
a linear scan before printing another report might be better than bloating the binary size by 1%.

That said, I think that the de-duplication functionality should definitely be implemented one
way or another, and it should be “on” by default (and I can’t imagine a reason why a user
may decide to turn it off).

> wrote:
>>
>> Have you measured the code size overhead from this extra flag? Did you
>> consider implementing this in the runtime library instead (by
>> suppressing duplicates based on return address or SourceLocation)?
>

Hmm, good point. I hadn't previously measured this, but I just did
now, see attached. Sizes are reported for text/data/bss/total as
reported by 'size', and ubsan calls are counted by occurrences of
calls in the actual resulting binary (grep'ing output from objdump
-d).

Summary: average increase across CINT2006 was 1.85% using
-fsanitize=integer with a neatly consistent ~5bytes per check
(including text).

For the ability to scale despite many sites triggering many times,
this does seem like a valid trade-off.

Previously IOC used a short linear scan table (~10-20 elements was
sweet spot IIRC) with fallback to a larger hashtable to manage
duplicates, but that was always a performance issue. As a useful data
point, a quick spot-check of 403.gcc shows 96 static locations
triggered a total of 3,476,066 times dynamically when just processing
one of the inputs used for the 'ref' input set (166.i). More on this
below.

>
> +1. ThreadSanitizer has to solve the same problem - we want to report
> each data race (pair of stack traces) exactly once. TSan runtime stores
> the
> stacks of printed reports (as a sequence of PCs) to do this
> de-duplication.
>

Great, didn't realize TSan already solved this problem. That said,
the problem is somewhat different I think:

* TSan supports differentiating based on stack trace (apparently), but
that seems less interesting for ubsan/integer checks, especially since
we don't print that information :). The byte-per-check approach
doesn't work for stack traces, so that's not really an option for tsan
as-is.
* I would (perhaps erroneously) expect tsan to have many fewer dynamic
invocations than ubsan/integer checks, which might suggest difference
trade-offs in the size vs performance department. Checking a byte 1
million times vs scanning and managing a vector of ~100 items > 1
million times might make the size increase rather preferable, even if
that's not the right decision for tsan (I have no idea what's right
for tsan :)).

TSan inserts a call to runtime library for each function entry/exit and
for each memory operation (load/store). Still, data races don't happen that
often, so we have to access hashtable which stores stack traces of
printed reports on slow path only. We haven't observed any performance
problems here (Dmitry may correct me if I'm wrong).

I turn it off only in tests.
There were no requests to turn it off. There were requests to make it
even more aggressive.

Thanks for discussion, sorry for the delay in responding. Holiday and all :).

Reply inline.

Previously IOC used a short linear scan table (~10-20 elements was
sweet spot IIRC) with fallback to a larger hashtable to manage
duplicates, but that was always a performance issue. As a useful data
point, a quick spot-check of 403.gcc shows 96 static locations
triggered a total of 3,476,066 times dynamically when just processing
one of the inputs used for the 'ref' input set (166.i). More on this
below.

I think the point I was trying to make above got lost in the other
details (and wasn't shown in the table):

Looking at 403.gcc, there were 13205 calls inserted, 13205 "possible"
locations for ubsan to fire.
Of these 13205 calls, 96 triggered dynamically a total of ~3.5 million times.

Avoiding linear scanning a list of 96 elements 3 million times seems
worth a 1% code increase, was the point.

While I've only gathered these numbers for 403.gcc recently, previous
experience with SPEC CINT2000 suggests this is not uncommon at all (4
of the 8 benchmarks that had any overflows had more than 25 unique
locations trigger many many times).

I didn't extensively test ubsan on real-world applications so it's hard for
me
to estimate the number of error reports it prints. But I think we need to
count
not the number of calls to __ubsan_handle (i.e. number of places in code
where
an error _might_ happen), but the number of actual unique reports printed by
ubsan.
If, say, it's at most 10-20, then storing PCs of all the erroneous
instructions and doing
a linear scan before printing another report might be better than bloating
the binary size by 1%.

Hard decision to make. Agreed regarding the 10-20 being threshold,
that's what I found when tuning IOC's runtime previously.

Given the commonly very high invocation count (thousands is common, if
not millions) I'd rather err on known minor code size increase with
predictable performance behavior instead of optimizing for cases where
few checks are invoked a small number of times and scaling poorly when
that's not the case.
Does this seem reasonable?

That said, I think that the de-duplication functionality should definitely
be implemented one
way or another, and it should be "on" by default (and I can't imagine a
reason why a user
may decide to turn it off).

Sounds good to me, and agreed :).

~Will

Thanks for discussion, sorry for the delay in responding. Holiday and all :).

Reply inline.

Previously IOC used a short linear scan table (~10-20 elements was
sweet spot IIRC) with fallback to a larger hashtable to manage
duplicates, but that was always a performance issue. As a useful data
point, a quick spot-check of 403.gcc shows 96 static locations
triggered a total of 3,476,066 times dynamically when just processing
one of the inputs used for the 'ref' input set (166.i). More on this
below.

I think the point I was trying to make above got lost in the other
details (and wasn't shown in the table):

Looking at 403.gcc, there were 13205 calls inserted, 13205 "possible"
locations for ubsan to fire.
Of these 13205 calls, 96 triggered dynamically a total of ~3.5 million times.

Avoiding linear scanning a list of 96 elements 3 million times seems
worth a 1% code increase, was the point.

While I've only gathered these numbers for 403.gcc recently, previous
experience with SPEC CINT2000 suggests this is not uncommon at all (4
of the 8 benchmarks that had any overflows had more than 25 unique
locations trigger many many times).

I didn't extensively test ubsan on real-world applications so it's hard for
me
to estimate the number of error reports it prints. But I think we need to
count
not the number of calls to __ubsan_handle (i.e. number of places in code
where
an error _might_ happen), but the number of actual unique reports printed by
ubsan.
If, say, it's at most 10-20, then storing PCs of all the erroneous
instructions and doing
a linear scan before printing another report might be better than bloating
the binary size by 1%.

Hard decision to make. Agreed regarding the 10-20 being threshold,
that's what I found when tuning IOC's runtime previously.

Given the commonly very high invocation count (thousands is common, if
not millions) I'd rather err on known minor code size increase with
predictable performance behavior instead of optimizing for cases where
few checks are invoked a small number of times and scaling poorly when
that's not the case.
Does this seem reasonable?

Here's what I would suggest: make the static data objects passed to
the ubsan handlers non-constant, and modify the SourceLocation in
place to mark the diagnostic as disabled. This should be possible with
no binary size impact (other than a little more code in the ubsan
runtime).

That said, I think that the de-duplication functionality should definitely
be implemented one
way or another, and it should be "on" by default (and I can't imagine a
reason why a user
may decide to turn it off).

Sounds good to me, and agreed :).

SGTM too.