Dynamic VMA in Sanitizers for AArch64

Hi folks,

After long talks with lots of people, I think we have a winning
strategy to deal with the variable nature of VMA address in AArch64.
It seems that the best way forward is to try the dynamic calculation
at runtime, evaluate the performance, and then, only if the hit is too
great, think about compile-time alternatives. I'd like to know if
everyone is in agreement, so we could get cracking.

  The Issues

If you're not familiar with the problem, here's a quick run down...

On most systems, the VMA address (and thus shadow mask and shift
value) is a constant. This produces very efficient code, as the shadow
address computation becomes an immediate shift plus a constant mask.
But AArch64 is different.

In order to execute 32-bit code, the kernel has to use 4k pages, and
that is currently configured with either 39 or 48 bits VMA. For 64-bit
only, 64k pages are set, and you can use either 42 or 48 VMA address.
In theory, the kernel could use even more bits and different page
sizes, and systems are free to choose, and have done so different
values already.

What it means is that the VMA value can change depending on the
kernel, and cross-compilation for testing on multiple systems will not
work unless the true value is computed at runtime. But it also means
that the value will have to be stored in a global constant, which will
require additional loads and register shifts per instrumentation,
which can slow down the execution even further.

  The Current Status

Right now, in order to test it, we made it into a compiler-build
option. As you build Clang/LLVM, you can use a CMake option to set the
VMA, with 39 being the default. We have 39 and 42 buildbots to make
sure all works well, but that's clearly the wrong solution for
anything other than enablement.

With all the sanitizers going in for AArch64, we can now focus on
making a good implementation for the VMA issue, in a way that benefits
both LLVM and GCC, since they have different usages (static vs dynamic
linkage).

With the build time option and making it static, we have the best
performance we could ever have. Means that any further change will
impact performance, but they're necessary, so we just need to take the
lower cost / higher benefit option.

  The Options

The two options we have are:

1. Dynamic VMA: instrument main() to check the VMA value and set a
global value. Instrument each function to load that value into a local
register. Instrument each load/store/malloc/free to check the VMA
based on that register. This may be optimised by the compiler for the
compiler instrumented code, but will not for the library calls.

2. Add a compiler option -mvma=NN that chooses at compile time the VMA
and makes it static in the user code. This has the same performance as
currently for compiler instrumented code, but will not be for library
calls, especially for the dynamic version. This is faster, but it's
also less flexible than option 1, though more flexible than the
current implementation.

  The Plan

Right now, we're planning on implementing the full dynamic VMA and
investigate the performance impacts. If it is within acceptable
ranges, we just go along with it, and check the compile-time flag at a
later time, as further optimisation.

If impact is too great, we might want to profile and implement -mvma
straight after the dynamic VMA checks. If that's the case, we should
keep *both* implementations, so that users could choose what suits
them best.

Either way, I'd like to get the opinion of everybody to make sure I'm
not forgetting anything before we start cracking the problem into an
acceptable solution.

cheers,
--renato

Thanks for writing this up Renato.

What you describe below has been the option I've preferred for a while,
so it looks like a good approach to me.

I just wanted to note that on AArch64, having the shadow offset in a register
rather than as an immediate, could result in faster execution rather
than slower, as the computation of the shadow address can be done
in a single instruction rather than 2 that way. Assuming
x(SHADOW_OFFSET) is the register containing the shadow offset:

        add x8, x(SHADOW_OFFSET), x0, #3

instead of

        lsr x8, x0, #3
        orr x8, x8, #0x1000000000

But as you say, it'll need to be measured what the overall performance
effect is of dynamic VMA support.

Thanks,

Kristof

You mean you want a dynamic shadow offset on aarch64 as opposed to
fixed kAArch64_ShadowOffset64 (1UL << 36) one?
I think this is completely unnecessary, all you need to change is
libsanitizer internals IMNSHO, all that is needed is to make runtime
decisions during libasan initialization on the memory layout, and
also (because 39 bit VMA is too slow), dynamic decision whether to use
32-bit or 64-bit allocator. See
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64435
for details.

  Jakub

Added kcc.

FYI optional dynamic offset would also help in not-so-rare situations when ASan's shadow range is stolen by early constructors in unsanitized libraries.

-Y

And - we'll finally be able to run ASan under Valgrind)

Jakub makes a good point, are you sure that there is no single shadow
offset value that works for all VMA variants? What exactly breaks when
1<<36 is used on 42-bit VMA?

Note, in our distros we are shipping 42-bit VMA and are using patch on
top of vanilla libsanitizer (with the 1UL << 36 shadow offset) and I don't
remember any bugs reported against this not working (and the testsuite works
too). So, assuming 39-bit VMA works too, that would show that at least
those two settings work, the question is if 48-bit VMA (or how many) works
too, and if it does, the next thing is tweaking the library so that it can
perhaps with some small but still acceptable performance hit decide between
those at runtime (e.g. kAllocatorSpace/kAllocatorSize could be turned into
non-const variables for aarch64, harder would be to add some allocator that
at runtime picks if it uses 32-bit or 64-bit allocator.

--- libsanitizer/asan/asan_allocator.h (revision 219833)
+++ libsanitizer/asan/asan_allocator.h (working copy)
@@ -100,6 +100,10 @@
# if defined(__powerpc64__)
const uptr kAllocatorSpace = 0xa0000000000ULL;
const uptr kAllocatorSize = 0x20000000000ULL; // 2T.
+# elif defined(__aarch64__)
+// Valid only for 42-bit VA
+const uptr kAllocatorSpace = 0x10000000000ULL;
+const uptr kAllocatorSize = 0x10000000000ULL; // 1T.
# else
const uptr kAllocatorSpace = 0x600000000000ULL;
const uptr kAllocatorSize = 0x40000000000ULL; // 4T.
--- libsanitizer/sanitizer_common/sanitizer_platform.h (revision 219833)
+++ libsanitizer/sanitizer_common/sanitizer_platform.h (working copy)
@@ -79,7 +79,7 @@
// For such platforms build this code with -DSANITIZER_CAN_USE_ALLOCATOR64=0 or
// change the definition of SANITIZER_CAN_USE_ALLOCATOR64 here.
#ifndef SANITIZER_CAN_USE_ALLOCATOR64
-# if defined(__aarch64__) || defined(__mips64)
+# if defined(__mips64)
# define SANITIZER_CAN_USE_ALLOCATOR64 0
# else
# define SANITIZER_CAN_USE_ALLOCATOR64 (SANITIZER_WORDSIZE == 64)
@@ -88,10 +88,10 @@

// The range of addresses which can be returned my mmap.
// FIXME: this value should be different on different platforms,
-// e.g. on AArch64 it is most likely (1ULL << 39). Larger values will still work
+// e.g. on AArch64 it is most likely (1ULL << 42). Larger values will still work
// but will consume more memory for TwoLevelByteMap.
#if defined(__aarch64__)
-# define SANITIZER_MMAP_RANGE_SIZE FIRST_32_SECOND_64(1ULL << 32, 1ULL << 39)
+# define SANITIZER_MMAP_RANGE_SIZE FIRST_32_SECOND_64(1ULL << 32, 1ULL << 42)
#else
# define SANITIZER_MMAP_RANGE_SIZE FIRST_32_SECOND_64(1ULL << 32, 1ULL << 47)
#endif

  Jakub

Hi Jakub,

My assumption is based on what I understood from my talks with various
people in sanitizers, libraries, GCC, LLVM and the kernel. Please,
correct me if I'm wrong.

IIUC, using larger VMAs on kernels with smaller VMAs work, but
restrict the space that you can use as a shadow region, thus limiting
the size of programs you can run with the sanitizers. Also, the higher
you go, the higher is the levels of indirection you need to use
memory, so using higher VMAs may be more of a performance hit than it
should. This may not be such a big deal for 42 vs. 39 bits, but the
new kernels will come with 48 bits, and there's talks to push it up to
56 bits in the near future.

So, there are only three paths we can take:

1. Keep it constant, today at 42, and increase the constant slowly, as
kernels start popping in with higher VMAs. This could slow down for
large values of VMA and low values on kernel.

2. Create a compiler flag -mvma=NN. This would be as fast as native
when chosen correctly, but could break if lower than the machine's
VMA.

3. Make it dynamic, so that the VMA value doesn't matter.

I really have no idea on what the impact of dynamic VMA will have on
the sanitizers, nor I have on what it would be if we choose an
arbitrarily large VMA value (say, 56), and run on the lowest VMA (39).
We need to benchmark those things.

My email was just a summary of my discussions and a look forward. We
believe we can implement the dynamic VMA with very little impact, but
before going there, we need to understand what's the impact of using
higher than necessary VMA values. This is all part of the
investigation process that we'll start now.

All our changes until now will have no impact on the GCC port. If
anything, it'll make your life easier, since now you don't need a
patch to move it to 42-bits, it's just a compiler flag. That's why I
wanted to involve you in this discussion from now on, since we'll be
taking decisions that *will* affect you, and we need to be sure
they're the right decisions on both static and dynamic cases. But we
can't take any decision without hard data to go by, and that's why
we'll be investigating the performance compromises of each model.

cheers,
--renato

You are mixing things up. The size of virtual address space in bits is one
thing, and another one is the chosen ASAN shadow offset.
The way ASAN works is that for a normal memory address
corresponding shadow memory address is (addr >> 3) + shadow_offset
(the 3 hardwired at least into GCC, not sure about llvm).
Right now at least in GCC aarch64 shadow_offset is 1UL << 36, constant.
All you need to do is the math how the address space needs to look out
for the various VMA sizes, and whether it clashes with where the kernel
will normally try to allocate shared libraries or the stack.
If you have libsanitizer built for a particular VMA size and that
1UL << 36 shadow offset, you can also just ask the library to be verbose
through env var and dump you the layout (ASAN_OPTIONS=verbosity=1).
You'll get several regions of normal memory, several regions of shadow
memory and perhaps some gaps (shadow memory of shadow memory, which isn't
really useful). What matters is whether the normal memory regions
in all those layouts cover normal location of stack, binaries, and where
kernel maps shared libraries.

  Jakub

You are mixing things up. The size of virtual address space in bits is one
thing, and another one is the chosen ASAN shadow offset.

Well, the shadow offset cannot overlap with the virtual address space,
and that's why the offset is different for different VMA values.

All you need to do is the math how the address space needs to look out
for the various VMA sizes, and whether it clashes with where the kernel
will normally try to allocate shared libraries or the stack.

That's the part I'm not sure of. From my understanding, the kernel
*may* use memory beyond the stack, and that's largely based on the VMA
setting. Trying to find a value that works *well* for all alternatives
might be too restricting, or so I'm told.

I'd like the opinion of someone that understand the kernel better than
I do, however, to have a more informed decision. I'll be very happy to
be wrong, here, and start using 56-bits VMA for all AArch64 from now
on (as this may be a very likely future).

cheers,
--renato