After long talks with lots of people, I think we have a winning
strategy to deal with the variable nature of VMA address in AArch64.
It seems that the best way forward is to try the dynamic calculation
at runtime, evaluate the performance, and then, only if the hit is too
great, think about compile-time alternatives. I'd like to know if
everyone is in agreement, so we could get cracking.
If you're not familiar with the problem, here's a quick run down...
On most systems, the VMA address (and thus shadow mask and shift
value) is a constant. This produces very efficient code, as the shadow
address computation becomes an immediate shift plus a constant mask.
But AArch64 is different.
In order to execute 32-bit code, the kernel has to use 4k pages, and
that is currently configured with either 39 or 48 bits VMA. For 64-bit
only, 64k pages are set, and you can use either 42 or 48 VMA address.
In theory, the kernel could use even more bits and different page
sizes, and systems are free to choose, and have done so different
What it means is that the VMA value can change depending on the
kernel, and cross-compilation for testing on multiple systems will not
work unless the true value is computed at runtime. But it also means
that the value will have to be stored in a global constant, which will
require additional loads and register shifts per instrumentation,
which can slow down the execution even further.
The Current Status
Right now, in order to test it, we made it into a compiler-build
option. As you build Clang/LLVM, you can use a CMake option to set the
VMA, with 39 being the default. We have 39 and 42 buildbots to make
sure all works well, but that's clearly the wrong solution for
anything other than enablement.
With all the sanitizers going in for AArch64, we can now focus on
making a good implementation for the VMA issue, in a way that benefits
both LLVM and GCC, since they have different usages (static vs dynamic
With the build time option and making it static, we have the best
performance we could ever have. Means that any further change will
impact performance, but they're necessary, so we just need to take the
lower cost / higher benefit option.
The two options we have are:
1. Dynamic VMA: instrument main() to check the VMA value and set a
global value. Instrument each function to load that value into a local
register. Instrument each load/store/malloc/free to check the VMA
based on that register. This may be optimised by the compiler for the
compiler instrumented code, but will not for the library calls.
2. Add a compiler option -mvma=NN that chooses at compile time the VMA
and makes it static in the user code. This has the same performance as
currently for compiler instrumented code, but will not be for library
calls, especially for the dynamic version. This is faster, but it's
also less flexible than option 1, though more flexible than the
Right now, we're planning on implementing the full dynamic VMA and
investigate the performance impacts. If it is within acceptable
ranges, we just go along with it, and check the compile-time flag at a
later time, as further optimisation.
If impact is too great, we might want to profile and implement -mvma
straight after the dynamic VMA checks. If that's the case, we should
keep *both* implementations, so that users could choose what suits
Either way, I'd like to get the opinion of everybody to make sure I'm
not forgetting anything before we start cracking the problem into an