Meet at MTV to discuss TSAN/android/aarch64?

Hi everyone.

Thanks to everyone for taking the time to meet yesterday. I think it was very productive!

I am writing to keep a “minutes” of the meet, as I remember them.

Action Items:

  • Jasonk: split up bionic patch to explicitly replace CPP macros with actual calls, one file at a time.
  • Google/Android: pthread_barrier interface in bionic
  • Enh: 1 paragraph to explain “namespacing” in bionic?? (in case we need to reconsider macro replacement again?)

Some technical notes/explanations that possibly weren’t made clear at the meeting.

The current compller-rt patch (on phabricator) is already a pretty minimal set. The only real bit of actual functionality (other than the infrastructure bits to enable TSAN on android/aarch64) is the TLS workaround.

i.e. the major pieces for my patch to compiler-rt are:

  • infrastructure (new definitions, cmake changes etc…)
  • TLS workaround
  • pthread_barrier-like interface for tests
  • disabling of some tests on android-aarch64

My current hypothesis is that the 120 tests (out of 200 or so) that are failing are due the behavioral mismatch of my “quick hack” to enable barrier-like behavior on android. Assuming that Google’s implementation of the barrier interface on android/aarch64 is a better fit with linux-x86_64, this will hopefully result in more tests passing.

  • Very important info regarding why __bionic_XXX() replacement for calls to XXX() will most likely need to be “global” (within bionic) in most cases:

  • deadlocks, crashes in TSAN were all due to unexpected interceptions WHILE a call was already being intercepted by TSAN.

  • The most obvious TSAN-only workaround is to ignore interceptions when this is taking place. This works for two threads, but does not scale to more threads without the danger of TSAN missing events.

  • Any unexpected nested call chains in bionic that TSAN currently intercepts will either need custom coding in TSAN to explicitly handle (for all possible cases in which such call chains can occur), OR replace those calls in bionic with non-interceptible ones. The former will obviously add much more complexity to TSAN (to wit, additional testing failure vulnerability), while the latter will mean more (mechanical) changes to BIONIC. Currently, my thinking is that the latter, (being mechanical in nature) is preferable.

  • To ENH: The same holds for calls to tcgetattr() → ioctl(), and isatty() → tcgetattr() → ioctl() call chains. TSAN intercepts tcgetattr() and ioctl(), and is not expecting recursive interceptions during them. Its not at all clear how to handle this within TSAN, especially among multiple threads. Even if a logic can be worked out, it’s fundamentally much clearer to simply disallow such interceptions to occur.

Things that might require future discussion:

  • need a build-bot/test bot on labs.llvm.org for android-bionic-aarch64-tsan tests to prevent insanity inducing regressions :slight_smile:

Thanks for reading!

-Jason

Hi everyone.

Thanks to everyone for taking the time to meet yesterday. I think it was
very productive!

I am writing to keep a "minutes" of the meet, as I remember them.

Action Items:

Jasonk: split up bionic patch to explicitly replace CPP macros with actual
calls, one file at a time.
Google/Android: pthread_barrier interface in bionic
Enh: 1 paragraph to explain "namespacing" in bionic?? (in case we need to
reconsider macro replacement again?)

grep for "namespace.h" and "un-namespace.h". interestingly, freebsd
and netbsd have this but openbsd doesn't.

Some technical notes/explanations that possibly weren't made clear at the
meeting.

The current compller-rt patch (on phabricator) is already a pretty minimal
set. The only real bit of actual functionality (other than the
infrastructure bits to enable TSAN on android/aarch64) is the TLS
workaround.

i.e. the major pieces for my patch to compiler-rt are:

infrastructure (new definitions, cmake changes etc..)
TLS workaround
pthread_barrier-like interface for tests
disabling of some tests on android-aarch64

My current hypothesis is that the 120 tests (out of 200 or so) that are
failing are due the behavioral mismatch of my "quick hack" to enable
barrier-like behavior on android. Assuming that Google's implementation of
the barrier interface on android/aarch64 is a better fit with linux-x86_64,
this will hopefully result in more tests passing.

Very important info regarding why __bionic_XXX() replacement for calls to
XXX() will most likely need to be "global" (within bionic) in most cases:

deadlocks, crashes in TSAN were all due to unexpected interceptions WHILE a
call was already being intercepted by TSAN.
The most obvious TSAN-only workaround is to ignore interceptions when this
is taking place. This works for two threads, but does not scale to more
threads without the danger of TSAN missing events.
Any unexpected nested call chains in bionic that TSAN currently intercepts
will either need custom coding in TSAN to explicitly handle (for all
possible cases in which such call chains can occur), OR replace those calls
in bionic with non-interceptible ones. The former will obviously add much
more complexity to TSAN (to wit, additional testing failure vulnerability),
while the latter will mean more (mechanical) changes to BIONIC. Currently,
my thinking is that the latter, (being mechanical in nature) is preferable.
To ENH: The same holds for calls to tcgetattr() --> ioctl(), and isatty() ->
tcgetattr() --> ioctl() call chains. TSAN intercepts tcgetattr() and
ioctl(), and is not expecting recursive interceptions during them.

what does this mean?

external/compiler-rt/lib/sanitizer_common/sanitizer_platform_interceptors.h:144:#define
SANITIZER_INTERCEPT_TCGETATTR SI_LINUX_NOT_ANDROID

This means that tcgetattr is not intercepted on Android. Either it was
not there before, or I looked at the wrong place.

Hi Jason,

I have a few additional points:

1. It works on glibc but not in bionic because glibc calls internal
allocation functions while bionic calls malloc/free, which get
instrumented when the thread is being destroyed. Calling bionic_malloc
on those cases should be enough to remove the problem altogether, but
we should not call bionic_malloc everywhere, since we want all other
calls to be instrumented and we don't want user facing behaviour to
change.

2. The changes to Android should be made in the implementation, not in
the header files. We're trying to avoid include nightmares and
pre-processor race conditions, as well as protecting users from
implementation details.

3. Google will prioritise pthread_barrier and other functionality that
helps TSAN works with bionic without the need for the TLS workaround
in compiler-rt. They'll also set up some internal bots to make sure we
test those things from now on.

cheers,
--renato

This means that tcgetattr is not intercepted on Android. Either it was
not there before, or I looked at the wrong place.

Yeah, I guess the intent is that tcgetattr should not be intercepted on
android . I don't think it was enough though, even if it worked as intended
-- At least on a fairly recent build for android, the chain isatty() ->
tcgettatr() (and then finally ioctl()) were being caught be TSAN
recursively, and were problematic. It seems simpler to not have those calls
caught by TSAN when theya are being made by bionic...

Thanks for the clarification, Evgenii.
-Jason

Hi Jason,

I have a few additional points:

1. It works on glibc but not in bionic because glibc calls internal
allocation functions while bionic calls malloc/free, which get
instrumented when the thread is being destroyed. Calling bionic_malloc
on those cases should be enough to remove the problem altogether, but
we should not call bionic_malloc everywhere, since we want all other
calls to be instrumented and we don't want user facing behaviour to
change.

The calls that are changed would be purely bionic-to-bionic calls of XXXX()
(XXXX() being malloc() or whatever else). There would be no user noticeable
diffferences. If XXXX() is called by a user routine, it would still be
intercepted.

2. The changes to Android should be made in the implementation, not in
the header files. We're trying to avoid include nightmares and
pre-processor race conditions, as well as protecting users from
implementation details.

??? There are no nightmares here. Of what do you speak?

3. Google will prioritise pthread_barrier and other functionality that
helps TSAN works with bionic without the need for the TLS workaround
in compiler-rt. They'll also set up some internal bots to make sure we
test those things from now on.

Sure. That sounds fine.