Address sanitizer regression test failures for PPC64 targets

Hi all,

I have been experiencing the failure of the address sanitizer regression tests for a PPC64 target (Power7 machine). My understanding is that most of the failures are related with the fact the stack is not being dumped.

I tried to understand what might be wrong and started by looking into the null_deref.cc test as it hangs during the test run. I observe that after the detection of the faulty memory access it receives a SEGV after entering ReportSIGSEGV() more precisely when it gets to the __intercept_strlen() and tries to access flags()->replace_str. The caller of __intercept_strlen() is get_cie_encoding() from libgcc (version 4.8.2 in my system).

As I am not familiar with the sanitizer implementation, I was wondering if this is an expected failure for PPC targets due to some incomplete implementation, an unexpected bug, or due to some misconfiguration in the Clang/LLVM build for PPC targets.

Has anyone experienced a similar issue?

Thanks in advance!
Samuel

+Bill Schmidt

Hi all,

I have been experiencing the failure of the address sanitizer regression
tests for a PPC64 target (Power7 machine). My understanding is that most of
the failures are related with the fact the stack is not being dumped.

I tried to understand what might be wrong and started by looking into the
null_deref.cc test as it hangs during the test run. I observe that after
the detection of the faulty memory access it receives a SEGV after entering
ReportSIGSEGV() more precisely when it gets to the __intercept_strlen() and
tries to access flags()->replace_str. The caller of __intercept_strlen()
is get_cie_encoding() from libgcc (version 4.8.2 in my system).

As I am not familiar with the sanitizer implementation, I was wondering if
this is an expected failure for PPC targets due to some incomplete
implementation, an unexpected bug, or due to some misconfiguration in the
Clang/LLVM build for PPC targets.

Has anyone experienced a similar issue?

Sanitizer used to work on PPC at some point, but currently it fails on most
of the tests from "check-asan" test suite on the PowerPC buildbot (
http://lab.llvm.org:8011/builders/sanitizer-ppc64-linux1).
I can't really diagnose the issue from your description. flags() is just a
pointer to a global variable, so I don't see why access to flags()->replace_str
will segfault.

Note that I've set the SA_NODEFER flag for the SEGV handler in the
ASan runtime only a couple of days ago.
Not sure that could've affected this test though; without that flag
the second SEGV would've simply crashed the program. But you can try
removing the flag from
compiler-rt/trunk/lib/sanitizer_common/sanitizer_posix_libcdep.cc and
see if that makes any difference.

HTH,
Alex

Alexey, Alexander,

Thanks for the suggestions. I tried removing the flag SA_NODEFER but it didn’t do any good… I have been digging into the problem with the null_deref test today but I was unable to clearly identify the problem. I suspect that it was either a bug with the calling convention/unwinding that lead to the flags() pointer to get corrupted. It is also possible that it was related with endianess issues caused by some bug in the pointer arithmetic inserted by the sanitizer code (there are many type and bit casts which makes hard to follow the references). I decided to upgrade the compiler I was using to build clang which made the problem with this testcase to go away (!).

Nevertheless, I still got problems in other testcases that may be potentially related with the problem I was getting before. E.g., in the new_array_cookie_test I am getting an infinite loop in the destructor of the array (delete operator). I noticed that the references passed to __asan_poison_cxx_array_cookie and __asan_load_cxx_array_cookie were pointing to values differing in the 4 most significant bytes, which made me suspect that the problem is related with endianess. I am reproducing part of the IR generated for this test:

store i64 %0, i64* %9, align 8, !dbg !35, !nosanitize !2
call void @__asan_poison_cxx_array_cookie(i64* %9), !dbg !35
%10 = getelementptr inbounds i8* %call, i64 8, !dbg !35
%11 = bitcast i8* %10 to %struct.C*, !dbg !35
call void @llvm.dbg.value(metadata !{%struct.C* %11}, i64 0, metadata !23), !dbg !36
%x = bitcast i8* %call to i32*, !dbg !37
%12 = ptrtoint i32* %x to i64, !dbg !37
%13 = lshr i64 %12, 3, !dbg !37
%14 = add i64 %13, 2199023255552, !dbg !37
%15 = inttoptr i64 %14 to i8*, !dbg !37
%16 = load i8* %15, !dbg !37
%17 = icmp ne i8 %16, 0, !dbg !37
br i1 %17, label %18, label %24, !dbg !37, !prof !38

; :18 ; preds = %entry
%19 = and i64 %12, 7, !dbg !37
%20 = add i64 %19, 3, !dbg !37
%21 = trunc i64 %20 to i8, !dbg !37
%22 = icmp sge i8 %21, %16, !dbg !37
br i1 %22, label %23, label %24

; :23 ; preds = %18
call void @__asan_report_store4(i64 %12), !dbg !37
call void asm sideeffect “”, “”()
unreachable

; :24 ; preds = %18, %entry
store i32 10, i32* %x, align 4, !dbg !37, !tbaa !39
%25 = call i64 @__asan_load_cxx_array_cookie(i64* %9), !dbg !44

In this code, %9 and %x alias but have different types (i64* and i32*), which makes the code in ‘store i32 10, i32* %x, align 4, !dbg !37, !tbaa !39’ to produce different results in machines with different endianess. In a big-endian machine the value 10 is written to the 4 most-significant bytes of the memory referenced by %9.

As I mentioned before, I don’t know the sanitizer implementation well so it is possible I may be missing something. Can anyone shed some light on this?

Thanks again!
Samuel

graycol.gifAlexander Potapenko —09/05/2014 02:06:43 AM—Note that I’ve set the SA_NODEFER flag for the SEGV handler in the ASan runtime only a couple of day

Adding Kostya who authored the __asan_poison_cxx_array_cookie() stuff

graycol.gif

Hi Samuel,
Which compiler versions were you using before/after ? At the moment,
I'm building with a gcc 4.9 snapshot, but can switch to something newer
if you had a recommendation.

Thanks,
-Will

Hi Will,

Do the sanitizer tests work for you with gcc 4.9? I was using clang 3.4.2 and started using clang 3.5.0. For both versions, I configured clang to use the gcc 4.8.2 tooIchain.

My understanding is that the compiler was not the problem but an endianess issue in the sanitizer implementation that was causing memory to get corrupted. The different versions of compiler just caused the memory corruption to affect different ranges, causing some tests that were not working to start working and vice-versa.

I identified one of the places with the endianess issue in my previous email. I’m unsure whether there are other places in the code that only work for little endian.

Thanks,
Samuel

Inactive hide details for Will Schmidt ---09/26/2014 11:25:23 AM---On Mon, 2014-09-08 at 22:00 -0400, Samuel F Antao wrote: > AWill Schmidt —09/26/2014 11:25:23 AM—On Mon, 2014-09-08 at 22:00 -0400, Samuel F Antao wrote: > Alexey, Alexander,

Samuel,

Was this ever resolved?

-Hal

Hi Hal,

No, the issue is still unresolved. Alexander Potapenko added Kostya to the thread who seems to have authored the code that is causing the failure in big-endian machines. I didn’t get any feedback so far. I’d be happy to help solve the issue but I didn’t have the time to improve my understanding of the sanitizer. There are several places in the sanitizer that are encoding different information in memory through aliasing pointers so it is not straightforward to me what the impact of fixing this particular endianess problem is for other components of the sanitizer…

Thanks,
Samuel

Alexey, Alexander,

Thanks for the suggestions. I tried removing the flag SA_NODEFER but it
didn't do any good... I have been digging into the problem with the
null_deref test today but I was unable to clearly identify the problem. I
suspect that it was either a bug with the calling convention/unwinding that
lead to the flags() pointer to get corrupted. It is also possible that it
was related with endianess issues caused by some bug in the pointer
arithmetic inserted by the sanitizer code (there are many type and bit
casts which makes hard to follow the references). I decided to upgrade the
compiler I was using to build clang which made the problem with this
testcase to go away (!).

Nevertheless, I still got problems in other testcases that may be
potentially related with the problem I was getting before. E.g., in the
new_array_cookie_test I am getting an infinite loop in the destructor of
the array (delete operator). I noticed that the references passed to
__asan_poison_cxx_array_cookie and __asan_load_cxx_array_cookie were
pointing to values differing in the 4 most significant bytes, which made me
suspect that the problem is related with endianess. I am reproducing part
of the IR generated for this test:

[I am sorry, I've missed this thread. Don't hesitate to ping me if I don't
respond in 1-2 days. ]

This is a new test for new functionality, currently present in clang's
asan, not in GCC.
We never tried it on big-endian machines.

  store i64 %0, i64* %9, align 8, !dbg !35, !nosanitize !2
  call void @__asan_poison_cxx_array_cookie(i64* %9), !dbg !35
  %10 = getelementptr inbounds i8* %call, i64 8, !dbg !35
  %11 = bitcast i8* %10 to %struct.C*, !dbg !35
  call void @llvm.dbg.value(metadata !{%struct.C* %11}, i64 0, metadata
!23), !dbg !36
  %x = bitcast i8* %call to i32*, !dbg !37
  %12 = ptrtoint i32* %x to i64, !dbg !37
  %13 = lshr i64 %12, 3, !dbg !37
  %14 = add i64 %13, 2199023255552, !dbg !37
  %15 = inttoptr i64 %14 to i8*, !dbg !37
  %16 = load i8* %15, !dbg !37
  %17 = icmp ne i8 %16, 0, !dbg !37
  br i1 %17, label %18, label %24, !dbg !37, !prof !38

; <label>:18 ; preds = %entry
  %19 = and i64 %12, 7, !dbg !37
  %20 = add i64 %19, 3, !dbg !37
  %21 = trunc i64 %20 to i8, !dbg !37
  %22 = icmp sge i8 %21, %16, !dbg !37
  br i1 %22, label %23, label %24

; <label>:23 ; preds = %18
  call void @__asan_report_store4(i64 %12), !dbg !37
  call void asm sideeffect "", ""()
  unreachable

; <label>:24 ; preds = %18, %entry
  store i32 10, i32* %x, align 4, !dbg !37, !tbaa !39
  %25 = call i64 @__asan_load_cxx_array_cookie(i64* %9), !dbg !44

In this code, %9 and %x alias but have different types (i64* and i32*),
which makes the code in 'store i32 10, i32* %x, align 4, !dbg !37, !tbaa
!39' to produce different results in machines with different endianess. In
a big-endian machine the value 10 is written to the 4 most-significant
bytes of the memory referenced by %9.

How does the test behave on PPC?

--kcc

graycol.gif

Hi Kostya,

Thanks for looking into this! Currently the test starts calling the destructor ~C() several times as it were in a infinite loop and it does not return.

I’ll try to explain how do I understand the problem. Just to make sure we are on the same page, I am attaching the instrumented IR that I obtain by running:

clang --driver-mode=g++ -fsanitize=address -mno-omit-leaf-frame-pointer -fno-omit-frame-pointer -fno-optimize-sibling-calls -g -m64 -O3 /home/sfantao/llvm-trunk/llvm-svn.src/projects/compiler-rt/test/asan/TestCases/Linux/new_array_cookie_test.cc -o debug.ir -S -emit-llvm

In this code %x (i32*) and %9 (*i64) alias to %call. When 10 is stored to %x, the way this is reflected in a load from %9 (the ASAN calls use this pointer instead of %x) differs depending on the endianess. Assuming that %9 and %x are 0x00, the memory layout before and after the store in big-endian will be

Addr - Before - After
0x00 0xZZ 0xZZ
0x01 0xZZ 0xZZ

0x02 0xZZ 0xZZ

0x03 0xZZ 0x0A

0x04 0xZZ 0xZZ

0x05 0xZZ 0xZZ

0x06 0xZZ 0xZZ

0x07 0xZZ 0xZZ

When a load using %9 is done, I get 0xZZZZZZ0AZZZZZZZZ. In a little-endian machine I would get 0xZZZZZZZZZZZZZZ0A instead, what is probably what you would expect. Then, when the destructor is called, whatever is decoding the size of ‘buffer’ loads the wrong information (possible zero or a very large number, causing the infinite loop).

Any hint on how to fix this? I understand some other information is being encoded in the pointers, so it is hard for me to understand whether fixing this for %x would have bad implications in other components of the sanitizer.

Let me know if you’d like me to provide more information.

Thanks again!
Samuel

graycol.gif

debug.ir (17.1 KB)

Hi Kostya,

Thanks for looking into this! Currently the test starts calling the
destructor ~C() several times as it were in a infinite loop and it does not
return.

Aha, I see the problem!

The test has an intentional bug that asan is supposed to find.
So, by default and with ASAN_OPTIONS=poison_array_cookie=1 the test should
never reach the point where it calls ~C.
Is this true for you on PPC?

But then, the test is also executed
with ASAN_OPTIONS=poison_array_cookie=0, i.e. the bug is not detected and
the execution goes further and tries to execute the DTOR.
And then yes, on a little-endian machine the DTOR will get executed 10
times, while on a big-endian one it will get executed near-infinite amount
of times.

Does r218841 help?

--kcc

graycol.gif

Kostya,

Thanks! The patch seems to have fixed the problem with poison array test! :slight_smile:

I still have some issues with some other testcases. E.g. stress_dtls is getting trapped into some infinite loop, although it seems not to be related with the previous issue. I’ll have to investigate a little bit more and will post my conclusions…

Thanks again,
Samuel