[RFC] UBSan check data size reductions

Hi all,

I’m proposing two ways to make ubsan instrumented code (really data…) smaller.

Why?
Our platform has limited resources and game teams have to play within
those to create their games. During debugging, they might have some
more relaxed limits, but they still have a budget they need to stick
to. And they tend to use everything they can!
The size of code+data counts towards that budget, and ubsan is very
happy to make code+data a lot bigger when enabled. This can end up
posing some problems and making ubsan harder to use than we would
like.
This might not be as useful for the usual operating systems people
use, but it might help other people with space constraints enable
ubsan more easily.

Biggest space “wasters”:
A quick objdump + grep found that not all check handlers are equal
(checks inserting >10k calls to handler):

clang (OS X):
793539 ___ubsan_handle_type_mismatch_abort
24204 ___ubsan_handle_load_invalid_value_abort
(4127 ___ubsan_handle_builtin_unreachable)

Game:
837785 __ubsan_handle_type_mismatch
20391 __ubsan_handle_load_invalid_value
16037 __ubsan_handle_out_of_bounds
11357 __ubsan_handle_add_overflow

Savings:
First, I made clang emit all the ubsan check data to a different
section (and the type_mismatch data, specifically, to another), to
ease accounting (patch attached: 0001-*.patch, provided so anyone can
check it out. I don’t think it’s useful for it to be in clang).

With an estimate of 48B per type_mismatch (SourceLocation (char* + 2x
u32) + TypeDescriptor ref (uptr) + Alignment (uptr) + unsigned char ==
8+2*4+8+8+1 == 33, which will be upped to 48 (!!) with padding (on
x86-64 we’re aligning all structures to 16B)), we end up with (for
type_mismatch checks’ static data):

Before:
clang: ~47.73 MiB
game: ~59.24 MiB

After 0003-*.patch:
clang: ~31.82 MiB (~67%)
game: ~39.49 MiB (~67%)

(These numbers don’t match 793539*48 bytes because we end up eliding
some checks. The relative savings in the data section will be the
same, though)

I also have a patch for minimizing ubsan source location data
(parametrizable), which would apply to all checks, but with string
merging, we end up not saving that much, in the size of cstring/rodata
sections. This patch allows the user to drop the first N path
components or only keep the last N patch components from the source
locations emitted by clang.

Before:
clang: ~1.28 MiB
game: ~10.75 MiB

After 0002-*.patch:
clang: ~1.19 MiB (~93%)
game: ~8.46 MiB (~79%)

Summing up:
I’m proposing two patches, one saves a lot more than the other in the
cases I’ve seen:
  - Make static check data for type_mismatch 7 bytes smaller. This
check is *by far* the most emitted one. And its static data (per check
handler call) is 48B (with padding). Minimised version is 32B (with
padding). (0003-*.patch)
  - Add -fstrip-path-components-from-checks=N (do bike-shed…), which
tells ubsan to strip the first N path components when emitting
filename information (or only emit the last N, if N is negative).
(0002-*.patch)

The data minimizing patch might not be implementable while keeping
library compatibility (as in: being able to use objects compiled with
an older version of clang on a more recent ubsan library. We might be
able to come up with a heuristic, though), but yields 1/3 savings in
type_mismatch static data (not accounting for file names nor type
descriptors).
The file name minimizing patch is simple enough, and wouldn’t imply
any changes to the sanitizer libraries.

I know these patches are missing tests, this email is an RFC, to know
if there is interest in having this upstream. Proper patches (updated
with comments and tests) will follow depending on the response.

Thank you,

  Filipe

0001-ubsan-cfi-Emit-static-check-data-to-ubsan_check_data.patch (1.64 KB)

0002-ubsan-Add-fstrip-path-components-from-checks.patch (4.36 KB)

0003-ubsan-Minimize-size-of-data-for-type_mismatch.patch (1.08 KB)

I think the 0003 patch makes a lot of sense. I wish I'd done that in the
first place! (You'll need a corresponding change to compiler-rt of course.)

For the 0002 patch, could we use a different representation instead, to
deduplicate the common suffixes? Something like:

  struct Path { Path *parent; const char *content; };

Split each path on each '/' (or some smarter heuristic), and produce:

  Path p0 { nullptr, "/path/to/build/" };
  Path p1 { &p0, "foo/" };
  Path p2 { &p1, "bar/" };
  Path foo_bar_baz_h { &p2, "baz.h" };
  Path foo_bar_quux_h { &p2, "quux.h" };

The downside of this is increased relocation size, which is already a
significant problem for ubsan binaries. I'm not sure how much of that we
can fix without teaching the linker about ubsan data.

The 0001 patch makes me wonder if we can move (most of) the ubsan data into
a section that's not mapped into the program's address space on startup,
and somehow lazily read it from the binary if a check fails. This will need
some work to cope with our duplicate suppression mechanism (which writes
back to the ubsan data), but if you're having address space issues, this
could help substantially.

I think the 0003 patch makes a lot of sense. I wish I'd done that in the
first place! (You'll need a corresponding change to compiler-rt of course.)

For the 0002 patch, could we use a different representation instead, to
deduplicate the common suffixes?

Err, common prefixes.

Hi Richard,

I think the 0003 patch makes a lot of sense. I wish I'd done that in the
first place! (You'll need a corresponding change to compiler-rt of course.)

For the 0002 patch, could we use a different representation instead, to
deduplicate the common suffixes? Something like:

  struct Path { Path *parent; const char *content; };

Split each path on each '/' (or some smarter heuristic), and produce:

  Path p0 { nullptr, "/path/to/build/" };
  Path p1 { &p0, "foo/" };
  Path p2 { &p1, "bar/" };
  Path foo_bar_baz_h { &p2, "baz.h" };
  Path foo_bar_quux_h { &p2, "quux.h" };

The downside of this is increased relocation size, which is already a
significant problem for ubsan binaries. I'm not sure how much of that we can
fix without teaching the linker about ubsan data.

I'm not sure it's worth it to add so much complexity when the maximum
savings of my filename patch (keeping only the last path component) is
between 7% and 20% of the rodata section.
It's also not possible to figure out a proper path component split
from inside one of clang's invocations, since we wouldn't know what
paths the program would have. It would have to be dealt with by having
clang tag the strings + an llvm pass to optimize them in a way we knew
UBSan understood (implying LTO is needed, btw).

If we were to do this per translation unit, the savings would most
likely be even smaller (or be worse than just having the whole file
names. e.g: Support/APFloat.cpp, Transforms/IPO/ConstantMerge.cpp
(both in llvm), and CGExpr.cpp (random files I tried this on) only
have checks in themselves, not in any header. So the only path emitted
is the one file's path).

Each filename is emitted once per translation unit anyway (or gets
merged to one instance per TU later), hence the not-so-big (but still
nice to have!) savings with the 0002-* patch.
Any asserts you might have might already be emitting the full pathname
anyway, too.

A quick check on clang (this was with a quick script, I might have
messed up) tells me that the Path complexity could probably halve (at
most. I got ~54% space usage) the string data for this (a pointer pair
per "split" in patch components (common prefix, different suffix)),
while stripping every path component but the last would make the
string data for filenames use ~27% of the space needed for the full
file names.

This is assuming the full path is only used by UBSan. If it's not, it
gets merged on the current solution, but gets added again on the Path
structure solution.

The 0001 patch makes me wonder if we can move (most of) the ubsan data into
a section that's not mapped into the program's address space on startup, and
somehow lazily read it from the binary if a check fails. This will need some
work to cope with our duplicate suppression mechanism (which writes back to
the ubsan data), but if you're having address space issues, this could help
substantially.

I'm unsure that would be a usable case for people that have these limitations.
In some systems, it might be impossible (for UBSan alone) to get the
path to the executable, either because that information isn't
available by default, or because we don't have the image in a
filesystem.
We'd also be mapping/unmapping file areas a lot (in systems that can),
which would be very noticeable.

I think I will go ahead with the 0003 patch and upload a proper
version of it soon (and the corresponding compiler-rt patch. That one
is done already :slight_smile: ).
As for the 0002 patch, I think I'll upload it too and we can keep the
"alternate version" discussion either here or in the other patch.

I'll probably only upload them tomorrow so feel free to comment here
instead of waiting for me to upload them if you have any comments.

Thank you,

  Filipe