MemorySanitizer, a tool that finds uninitialized reads and more

Hello llvmdev,

I would like to propose and discuss yet another dynamic tool, which we call MemorySanitizer (msan).
The main goal of the tool is to detect uses of uninitialized memory (the major feature of Valgrind/Memcheck not covered by AddressSanitizer).
It will also find use-after-destruction-but-before-free in C++.

The algorithm of the tool is similar to that of Memcheck (http://static.usenix.org/event/usenix05/tech/general/full_papers/seward/seward.pdf).
We associate a few shadow bits with every byte of the application memory,
poison the shadow of the malloc-ed or alloca-ed memory,
load the shadow bits on every memory read,
propagate the shadow bits through some of the arithmetic instruction (including MOV),
store the shadow bits on every memory write,
report a bug on some other instructions (e.g. JMP) if the associated shadow is poisoned.

But there are differences too.

The first and the major one: compiler instrumentation instead of binary instrumentation.
This gives us much better register allocation (function-wide instead of local),
possible compiler optimizations (static analysis can prove that some accesses always read initialized memory),
and a fast start-up.
Our preliminary measurements show 3x-4x slowdown; compare it to Memchecks’s 20x and DrMemory’s 10x.
(See http://groups.csail.mit.edu/commit/papers/2011/bruening-cgo11-drmemory.pdf for those numbers).
But this brings the major issue as well: msan needs to see all program events, including system calls and reads/writes in system libraries,
so we either need to compile everything with msan or use a binary translation component to instrument pre-built libraries (with DynamoRIO? PIN?).

Question: is there any usable project in LLVM land which performs binary instrumentation (x86->LLVM->x86), either statically or dynamically?

Another difference from Memcheck is that we propose to use 8 shadow bits per byte of application memory and use a
direct shadow mapping (for 64-bit linux that is just clearing 46-th bit of the application memory address).
This greatly simplifies the instrumentation code and avoids races on shadow updates
(Memcheck is single-threaded so races are not a concern there.
Memcheck uses 2 shadow bits per byte with a slow path storage that uses 8 bits per byte).

Suggestions? Objections?
Unless there is a general resentment against msan, we will soon start sending the code for review.
(we already have a bit messy implementation, which at the top level looks very much like asan and tsan, and even shares some code with them.
The major difference here is that the compiler part is relatively more complicated than asan/tsan and run-time part is very simple).

FAQ:
Q. Why can’t we combine msan and asan?
Addressability checker (like asan) requires little shadow memory, but needs large redzone around allocated objects.
Tools that track uninitialized/tainted data need bit-per-bit shadow in worst case, but don’t need redzones.
So, if we merge the tools together we multiply the memory overheads.
The instrumentation costs in a combined tool are mostly added to each other (e.g. asan needs to poison redzones and msan needs to propagate shadow through arithmetic insns).

Thanks,

–kcc

Can you make it possible for ASAN to share the same layout? I expect
that both will often be used together...

Joerg

Another difference from Memcheck is that we propose to use 8 shadow bits
per byte of application memory and use a
direct shadow mapping (for 64-bit linux that is just clearing 46-th bit of
the application memory address).
This greatly simplifies the instrumentation code and avoids races on shadow
updates
(Memcheck is single-threaded so races are not a concern there.
Memcheck uses 2 shadow bits per byte with a slow path storage that uses 8
bits per byte).

Can you make it possible for ASAN to share the same layout?

Not sure I understand you. What layout?

Shadow memory.

Joerg

Another difference from Memcheck is that we propose to use 8 shadow bits
per byte of application memory and use a
direct shadow mapping (for 64-bit linux that is just clearing 46-th bit
of
the application memory address).
This greatly simplifies the instrumentation code and avoids races on
shadow
updates
(Memcheck is single-threaded so races are not a concern there.
Memcheck uses 2 shadow bits per byte with a slow path storage that uses 8
bits per byte).

Can you make it possible for ASAN to share the same layout?

Not sure I understand you. What layout?

Shadow memory.

yes, asan and msan shadow could co-exist, at least on 64-bit linux with disabled ASLR.
But the problem is that the memory overheads will multiply – the combined tool will be more expensive to use
than two separate tools together.

–kcc

Which is what I am asking about. I don't really have a problem with
using a one-to-one mapping, if it makes both ASAN and MSAN more
efficient in terms of runtime overhead.

Joerg

Another difference from Memcheck is that we propose to use 8 shadow
bits

per byte of application memory and use a
direct shadow mapping (for 64-bit linux that is just clearing 46-th
bit

of

the application memory address).
This greatly simplifies the instrumentation code and avoids races on
shadow
updates
(Memcheck is single-threaded so races are not a concern there.
Memcheck uses 2 shadow bits per byte with a slow path storage that
uses 8

bits per byte).

Can you make it possible for ASAN to share the same layout?

Not sure I understand you. What layout?

Shadow memory.

yes, asan and msan shadow could co-exist, at least on 64-bit linux with
disabled ASLR.
But the problem is that the memory overheads will multiply – the combined
tool will be more expensive to use
than two separate tools together.

Which is what I am asking about. I don’t really have a problem with
using a one-to-one mapping, if it makes both ASAN and MSAN more
efficient in terms of runtime overhead.

one-to-one mapping will make ASAN much less efficient.
I meant that we may have both mappings (1:1 for MSAN and 8:1 for ASAN) in the same process, but it makes little sense to me.

–kcc

Why can't ASAN use the same window as MSAN? That's the part I don't
understand.

Joerg

Another difference from Memcheck is that we propose to use 8
shadow

bits

per byte of application memory and use a
direct shadow mapping (for 64-bit linux that is just clearing
46-th

bit

of

the application memory address).
This greatly simplifies the instrumentation code and avoids
races on

shadow

updates
(Memcheck is single-threaded so races are not a concern there.
Memcheck uses 2 shadow bits per byte with a slow path storage
that

uses 8

bits per byte).

Can you make it possible for ASAN to share the same layout?

Not sure I understand you. What layout?

Shadow memory.

yes, asan and msan shadow could co-exist, at least on 64-bit linux with
disabled ASLR.
But the problem is that the memory overheads will multiply – the
combined
tool will be more expensive to use
than two separate tools together.

Which is what I am asking about. I don’t really have a problem with
using a one-to-one mapping, if it makes both ASAN and MSAN more
efficient in terms of runtime overhead.

one-to-one mapping will make ASAN much less efficient.
I meant that we may have both mappings (1:1 for MSAN and 8:1 for ASAN) in
the same process, but it makes little sense to me.

Why can’t ASAN use the same window as MSAN? That’s the part I don’t
understand.

asan uses 8:1 mapping, so the shadow overhead is 9/8.
But the real overhead comes from the heap redzones.
With 32-byte redzones we observe 2x-4x memory bloat, sometimes more.
If asan starts using 1:1 mapping (which was in the early version), this bloat will be multiplied by 2, not by 9/8.
Besides, un/poisoning the shadow in asan will become 8x more expensive (more important for stack).

Why do you worry about this?

–kcc

I’ve just sent a code review request to llvm-commits.

–kcc

Hi again,

MemorySanitizer (msan) is now mature enough to bootstrap LLVM and run it w/o any additional tools.
Msan has already found one bug in LLVM itself: http://llvm.org/bugs/show_bug.cgi?id=13929

Would anyone be willing to do a codereview (it was sent to llvm-commits: http://permalink.gmane.org/gmane.comp.compilers.llvm.cvs/123253)

Thanks,

–kcc

I recall Chandler saying a few months ago that he was going to review that. Did it just get lost in the shuffle? – John T.