Weird msan problem

Hello everybody,

I’ve run into some strange behavior with memory sanitizer that I can’t explain and hope somebody with more knowledge of the implementation would be able to help me out or at least point me into the right direction.

For background, I’m using memory sanitizer to check Julia (julialang.org), which uses (or at least will once I track down a few bugs) MCJIT for the code compilation. So far I have rebuilt the runtime and all dependencies (including LLVM, libcxx, etc.) with memory sanitizer enabled and added the instrumentation pass in the appropriate place in the julia code generator.

I’m now going through the usual bootstrap which basically loads the standard library and compiles it, does inference, etc. This works fine for several hours (this is usually much faster - by which I mean several hundred time - I suspect the issue is with MCJIT having to process a ton more relocations and code and being inefficient at it, but I can’t prove that). That’s not the issue however. Eventually, I get

==17150== WARNING: MemorySanitizer: use-of-uninitialized-value
#0 0x7f417cea3189 in bitvector_any1 /home/kfischer/julia-san/src/support/bitvector.c:177

[ snip ]

Uninitialized value was created by a heap allocation
#0 0x7f41815de543 in __interceptor_malloc /home/kfischer/julia-san/deps/llvm-svn/projects/compiler-rt/lib/msan/msan_interceptors.cc:854
#1 0x7f417cc7d7f1 in alloc_big /home/kfischer/julia-san/src/gc.c:355

[snip]

Now, by going through it in the debugger, I see

(gdb) f 3
#3 0x00007f417cea318a in bitvector_any1 (b=0x60c000607240, b@entry=, offs=0, offs@entry=, nbits=256, nbits@entry=)
at bitvector.c:177
177 if ((b[0] & mask) != 0) return 1;
(gdb) p __msan_print_shadow(&b,8)
ff ff ff ff ff ff ff ff
o: 3f0010a6 o: 80007666

which seems to indicate that the local variable b has uninitialized data. I’m having a hard time believing that though, since if I look at the functions before it, the place where it’s coming from is initialized:

#4 0x00007f41755208a8 in julia_isempty248 ()
#5 0x00007f417c163e3d in jl_apply (f=0x606000984d60, f@entry=, args=0x7fff9132da20, args@entry=, nargs=1,
nargs@entry=) at ./julia.h:1043

(here’s the code of that julia function for reference)

isempty(s::IntSet) =
!s.fill1s && ccall(:bitvector_any1, Uint32, (Ptr{Uint32}, Uint64, Uint64), s.bits, 0, s.limit)==0

Looking at where that value is coming from:

(gdb) f 5
#5 0x00007f417c163e3d in jl_apply (f=0x606000984d60, f@entry=, args=0x7fff9132da20, args@entry=, nargs=1,
nargs@entry=) at ./julia.h:1043
1043 return f->fptr((jl_value_t*)f, args, nargs);

(gdb) p ((jl_array_t*)((void**)args[0])[1])->data
$43 = (void ) 0x60c000607240
(gdb) p __msan_print_shadow(((jl_array_t
)((void**)args[0])[1]),0x30)
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
o: d800496 o: d800496 o: d800496 o: d800496 o: d800496 o: d800496 o: d800496 o: d800496 o: d800496 o: d800496 o: d800496 o: d800496

There are no uninitialized values to be seen anywhere and the b value isn’t touched before that line, so I’m a little stumped.

One note I should make is that I did have to implement TLS support myself in MCJIT for this to work (I’ll upstream the patch soon), so I may have made a mistake, but I haven’t found anything wrong yet. If nothing looks unusual, I’d also appreciate pointers on what to look for in the TLS variables.

Thank you for your help,
Keno

This is really cool. I've not heard of anyone using MSan with MSJIT before.

I assume there are transitions between JITted code and native helper
functions. How are you handling them? Are native functions
MSan-instrumented?
MSan is passing shadow across function calls in TLS slots. Does your
TLS implementation guarantee that accesses to __msan_param_tls from
JITted and from native code map to the same memory?

Yes, both JIT code and the native runtime are instrumented. I am under the impressions that the the C library should guarantee that from the way the relocations are implemented as long as both native and JITed code are on the same thread (but I will verify this and report back).

I have verified that both TLS implementations indeed find the same area of memory. Anything else I could look for?

Also, I was looking at the instrumented LLVM code and I noticed that the LLVM global variable is generated with size 1000, while kMsanParamTlsSizeInWords == 100, could there be some sort of overflow happening?

How is ccall() implemented? If it manually sets up a stack frame, then
it also needs to store argument shadow values in paramtls.

I don't think there is an overflow, unless you have a _lot_ of
arguments in a function call.

I don’t think we’re doing anything unusual. Here’s the relevant snippet from the generated IR for ccall (after instrumentation):

ccall: ; preds = %pass6, %113
%114 = phi i8* [ %110, %113 ], [ %108, %pass6 ]
%115 = inttoptr i64 %17 to i64*
%116 = bitcast i8* %69 to i32*, !dbg !7
%117 = bitcast i8* %114 to i32 (i32*, i64, i64), !dbg !7
store i64 %_msret13, i64
getelementptr inbounds ([1000 x i64]* @__msan_param_tls, i64 0, i64 0), align 8, !dbg !7
store i32 %70, i32* getelementptr inbounds ([1000 x i32]* @__msan_param_origin_tls, i64 0, i64 0), align 4, !dbg !7
store i64 0, i64* inttoptr (i64 add (i64 ptrtoint ([1000 x i64]* @__msan_param_tls to i64), i64 8) to i64*), align 8, !dbg !7
store i32 0, i32* inttoptr (i64 add (i64 ptrtoint ([1000 x i32]* @__msan_param_origin_tls to i64), i64 8) to i32*), align 4, !dbg !7
store i64 0, i64* inttoptr (i64 add (i64 ptrtoint ([1000 x i64]* @__msan_param_tls to i64), i64 16) to i64*), align 8, !dbg !7
store i32 0, i32* inttoptr (i64 add (i64 ptrtoint ([1000 x i32]* @__msan_param_origin_tls to i64), i64 16) to i32*), align 4, !dbg !7
store i32 0, i32* bitcast ([8 x i64]* @__msan_retval_tls to i32*), align 8, !dbg !7
%118 = call i32 %117(i32* %116, i64 %87, i64 %107), !dbg !7, !julia_type !9

The code for ccall looks right. Sounds like you have a very small
range of instructions where an uninitialized value appear. You could
try debugging at asm level. Shadow for b should be passed at offset 0
in __msan_param_tls.

MSan could propagate shadow through arithmetic and even some logic
operations (like select). It could be that b is clean on function
entry, but then something uninitialized gets mixed in.

Also, what is this @entry stuff? It looks like bitvector_any1 has 6
arguments, but your ccall snippet only sets up tls for 3.

The @entry stuff is just a gdb artifact. I’ve been tracking this back a little further, and it seems there’s at least an origin propagation problem:

(gdb) list
281 goto value_to_pointer_error;
282 }
283 else {
284 if (jl_is_cpointer_type(jvt) && jl_tparam0(jvt) == jt) {
285 void ptr = jl_unbox_voidpointer(v);
286 assert(__msan_test_shadow(&ptr,sizeof(void
*)) == -1);
287 return (void*)ptr;
288 }
289 }
290
(gdb) p __msan_print_shadow(v,16)
00 00 00 00 00 00 00 00 ff ff ff ff ff ff ff ff
o: 40000e23 o: 40000e23 o: 40000e23 o: 40000e23
$21 = void
(gdb) p __msan_print_shadow(&ptr,8)
ff ff ff ff ff ff ff ff
o: 40000e23 o: 80007614
$22 = void

Notice the origin of the lower bits is off. Any idea as to why?

Looks like when you materialize the stores, you should check the size of the the store and emit an appropriate amount of stores to the origin shadow (or just a memset intrinsic?).

Looks like when you materialize the stores, you should check the size of the
the store and emit an appropriate amount of stores to the origin shadow (or
just a memset intrinsic?).

Yes. There is an old bug for that on the code.google.com site, but it
has _never_ caused problems. I guess partial reads at a non-zero
offset are pretty uncommon, and also LLVM IR tends to use smaller
stores (larger ones are transformed to memset/memcpy and we handle
those correctly). Should be fixed anyway.

Note that __msan_print_shadow can lie about origin a bit if the
address is not 4-byte aligned (it would print origins for aligned
locations instead).

Yes, it would be great to get that fixed.

I'm about to have another pass at this, so was wondering if this was
ever fixed (the reason I'm taking another pass is that I managed to
find the MCJIT issue that was causing this to take forever).

__msan_print_shadow has been fixed and improved a lot.

Origins for 8-byte stores are not fixed yet.

Btw, try -fsanitize-memory-track-origins=2.