Curiosity about transform changes under Sanitizers (Was: [PATCH] Disable branch folding with MemorySanitizer)

Just moving this branch of the thread out of the review because I don't
want to derail the review thread...

Kostya - why are these two cases not optimization bugs in general? (why do
they only affect sanitizers?)

Just moving this branch of the thread out of the review because I don't
want to derail the review thread...

Kostya - why are these two cases not optimization bugs in general? (why do
they only affect sanitizers?)

The recent case from mozilla (
https://code.google.com/p/thread-sanitizer/issues/detail?id=40#c2) is a
legal
optimization -- it hoists a safe load (i.e. a load which is known not to
fail) out of conditional branch.
It reduces the number of basic blocks and branches, and so I think it's
good in general.
I can't imagine a case where this optimization will break a valid program.
Which is the second one you are referring to?

--kcc

Just moving this branch of the thread out of the review because I don't
want to derail the review thread...

Kostya - why are these two cases not optimization bugs in general? (why
do they only affect sanitizers?)

The recent case from mozilla (
https://code.google.com/p/thread-sanitizer/issues/detail?id=40#c2) is a
legal
optimization -- it hoists a safe load (i.e. a load which is known not to
fail) out of conditional branch.

Hmm, OK. I can see how the optimization is reasonable. I'll have to read
up/chat more about how TSan works to understand how it gets caught up by
this.

It reduces the number of basic blocks and branches, and so I think it's
good in general.
I can't imagine a case where this optimization will break a valid program.
Which is the second one you are referring to?

The load widening issue you mentioned in another reply on the same thread.

The root cause of those issues is the fact that sanitizers verify
C++-level semantics with LLVM IR level instrumentation. For example,
speculative loads are OK in IR if it can be proved that the load won't
trap, but in C++ it would be a data race.

My $0.02 - I'm not sure the transformation introduces a data race.

To the best of my understanding, the point of the C++11/C11 memory model is to allow a wide array of compiler transformations - including speculative loads - for non-atomic variables.
I believe what's most likely happening (without looking at the Mozilla source) is that the original program contains a C++ data race, and the transformation exposes it to TSan.

From: "Evgeniy Stepanov" <eugeni.stepanov@gmail.com>
To: "Kostya Serebryany" <kcc@google.com>
Cc: "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>
Sent: Tuesday, November 19, 2013 10:55:11 AM
Subject: Re: [LLVMdev] Curiosity about transform changes under Sanitizers (Was: [PATCH] Disable branch folding with
MemorySanitizer)

The root cause of those issues is the fact that sanitizers verify
C++-level semantics with LLVM IR level instrumentation. For example,
speculative loads are OK in IR if it can be proved that the load
won't
trap, but in C++ it would be a data race.

So you're saying that *if* the load had been unconditional in the original C++ program, then it would have been a data race? That does not sound right: the data race is not a property of the load itself, it is a property of the use of that data.

-Hal

My $0.02 - I'm not sure the transformation introduces a data race.

To the best of my understanding, the point of the C++11/C11 memory model
is to allow a wide array of compiler transformations - including
speculative loads - for non-atomic variables.
I believe what's most likely happening (without looking at the Mozilla
source) is that the original program contains a C++ data race, and the
transformation exposes it to TSan.

The original program is race-free.
I've posted a minimized reproducer that actually triggers a tsan false
positive at O1 here:
https://code.google.com/p/thread-sanitizer/issues/detail?id=40#c5

Just moving this branch of the thread out of the review because I don't
want to derail the review thread...

Kostya - why are these two cases not optimization bugs in general? (why
do they only affect sanitizers?)

The recent case from mozilla (
https://code.google.com/p/thread-sanitizer/issues/detail?id=40#c2) is a
legal
optimization -- it hoists a safe load (i.e. a load which is known not to
fail) out of conditional branch.

Hmm, OK. I can see how the optimization is reasonable. I'll have to read
up/chat more about how TSan works to understand how it gets caught up by
this.

It reduces the number of basic blocks and branches, and so I think it's
good in general.
I can't imagine a case where this optimization will break a valid
program.
Which is the second one you are referring to?

The load widening issue you mentioned in another reply on the same thread.

That was https://code.google.com/p/address-sanitizer/issues/detail?id=20
(asan)
If you have e.g. "char a[3]" aligned by 4 and you load a[0] and a[2], it
is tempting to load 4 bytes from (int*)a instead.
This is legal from LLVM point of view, but it makes asan think that we
overflow 'a' by one byte to the right.

In tsan it's even worse.
Consider we have "char a[4]" aligned by 4, and we load a[1] and a[3].
Then the optimizer replaces it with one load from (int*)a.
But another thread is reading a[2]. There was no race on C++ level, but we
now have a race on LLVM level.

Note that valgrind suffers from all the same issues which makes it nearly
unusable at large scale with anything other than -O0.

--kcc

What I’m trying to say is that according to my understanding of the C++11 memory model, even in that small reproducer, the store to g and the load from g are in fact a data race.

(This is regardless of the fact the load is protected by a branch that is not taken.)

What I’m trying to say is that according to my understanding of the
C++11 memory model, even in that small reproducer, the store to g and the
load from g are in fact a data race.

(This is regardless of the fact the load is protected by a branch that is
not taken.)

My understanding of the standard is quite the opposite.

From: "Michael M Kuperstein" <michael.m.kuperstein@intel.com>
To: "Kostya Serebryany" <kcc@google.com>
Cc: "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>
Sent: Tuesday, November 19, 2013 11:56:44 AM
Subject: Re: [LLVMdev] Curiosity about transform changes under Sanitizers (Was: [PATCH] Disable branch folding with
MemorySanitizer)

What I’m trying to say is that according to my understanding of the
C++11 memory model, even in that small reproducer, the store to g
and the load from g are in fact a data race.

(This is regardless of the fact the load is protected by a branch
that is not taken.)

Can you please explain this more? I don't think this makes sense.

-Hal

From: "Kostya Serebryany" <kcc@google.com>
To: "Michael M Kuperstein" <michael.m.kuperstein@intel.com>
Cc: "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>
Sent: Tuesday, November 19, 2013 11:45:39 AM
Subject: Re: [LLVMdev] Curiosity about transform changes under Sanitizers (Was: [PATCH] Disable branch folding with
MemorySanitizer)

My $0.02 - I'm not sure the transformation introduces a data race.

To the best of my understanding, the point of the C++11/C11 memory
model is to allow a wide array of compiler transformations -
including speculative loads - for non-atomic variables.
I believe what's most likely happening (without looking at the
Mozilla source) is that the original program contains a C++ data
race, and the transformation exposes it to TSan.

The original program is race-free.
I've posted a minimized reproducer that actually triggers a tsan
false positive at O1 here:
https://code.google.com/p/thread-sanitizer/issues/detail?id=40#c5

Here's my problem:

int g;
int foo(int cond) {
  if (cond)
    return g;
  return 0;
}

is being transformed into this (at the IR level):

int g;
int foo(int cond) {
  return cond ? g : 0;
}

but I could have just as easily made that transformation at the C++ level as well, and I don't see that as introducing a data race here (in practice). When you have a load that is used only conditionally, TSAN should only record it when the condition is true. IMHO, to do otherwise, is a bug in TSAN. Furthermore, if you try and do this by disabling all optimizations that transform conditionally-executed blocks into selects, you'll have a lot more work to do (because you'll also need to handle every place in the backend that does if conversion).

-Hal

> From: "Kostya Serebryany" <kcc@google.com>
> To: "Michael M Kuperstein" <michael.m.kuperstein@intel.com>
> Cc: "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>
> Sent: Tuesday, November 19, 2013 11:45:39 AM
> Subject: Re: [LLVMdev] Curiosity about transform changes under
Sanitizers (Was: [PATCH] Disable branch folding with
> MemorySanitizer)
>
>
>
>
>
>
>
>
>
>
> My $0.02 - I'm not sure the transformation introduces a data race.
>
> To the best of my understanding, the point of the C++11/C11 memory
> model is to allow a wide array of compiler transformations -
> including speculative loads - for non-atomic variables.
> I believe what's most likely happening (without looking at the
> Mozilla source) is that the original program contains a C++ data
> race, and the transformation exposes it to TSan.
>
>
>
> The original program is race-free.
> I've posted a minimized reproducer that actually triggers a tsan
> false positive at O1 here:
> https://code.google.com/p/thread-sanitizer/issues/detail?id=40#c5
>
>

Here's my problem:

int g;
int foo(int cond) {
  if (cond)
    return g;
  return 0;
}

is being transformed into this (at the IR level):

int g;
int foo(int cond) {
  return cond ? g : 0;

This doesn't have the same semantics as the IR transformed 'select',
though, does it?

In IR 'g' is unconditionally loaded before the select. In C++, 'g' is
completely unevaluated if 'cond' is false.

I agree. It is a potential data race, if the branch is taken then it definitely is a race because the standard explicitly does not define an ordering on loads and stores of non-atomic variables unless some happens-before relationship is established by some other operation (which does not occur in this example). In the case where the branch is not taken, then there is no data race because in the C++ abstract machine there is no load, whether there is one in the underlying concrete machine or not.

I believe that this transform is valid, because there are only two possible cases here:

- Some other code has established a happens-before relationship between the load and the stores, giving a well-defined ordering, however this code is not visible in the function foo() and so the load may happen anywhere.

- No other code establishes a happens-before relationship, and therefore the order of the load and the store is completely undefined and as long as the load doesn't trap it is completely safe to hoist it.

HOWEVER, I would dispute describing this transform as an optimisation in this case. If the two threads are running on different cores, then we have likely introduced some false sharing and have forced the two cores to exchange some bus traffic for cache coherency, even though is is completely valid in the C++ abstract machine model for the load of g to return a stale cache value. This is only a real problem on strongly ordered architectures (e.g. x86), but given the relative cost of a cache shootdown and everything else in this test case (with the exception of the thread creation), I wouldn't be surprised if it ended up slowing things down. Especially given that a modern superscalar CPU will speculatively execute the load ANYWAY if it can do so from cache, and if it can't then the performance improvement from doing it before the branch will likely be negligible.

For single-core, in-order, single-issue architectures, or multicore, weakly ordered, in-order, single-issue architectures, it's probably a win...

David

Quoting the C11 standard, section 5.1.2.4:

(4) “Two expression evaluations conflict if one of them modifies a memory location and the other one reads or modifies the same memory location”

(25) “The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other. Any such data race results in undefined behavior”

(18) “An evaluation A happens before an evaluation B if A is sequenced before B or A inter-thread happens before B”

The load and the store are in conflict (one is a load and the other is a store to the same location) and neither is atomic. So the question is whether they happen-before each other.

Since they are not in the same thread, they are not sequenced before one another, so the only option is inter-thread happens before.

I can’t see an inter-thread happens-before relation here.

From: "David Blaikie" <dblaikie@gmail.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "Kostya Serebryany" <kcc@google.com>, "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>
Sent: Tuesday, November 19, 2013 12:19:46 PM
Subject: Re: [LLVMdev] Curiosity about transform changes under Sanitizers (Was: [PATCH] Disable branch folding with
MemorySanitizer)

> From: "Kostya Serebryany" < kcc@google.com >
> To: "Michael M Kuperstein" < michael.m.kuperstein@intel.com >
> Cc: "LLVM Developers Mailing List" < llvmdev@cs.uiuc.edu >
> Sent: Tuesday, November 19, 2013 11:45:39 AM
> Subject: Re: [LLVMdev] Curiosity about transform changes under
> Sanitizers (Was: [PATCH] Disable branch folding with
> MemorySanitizer)
>
>
>
>
>
>
>
>

>
>
> My $0.02 - I'm not sure the transformation introduces a data race.
>
> To the best of my understanding, the point of the C++11/C11 memory
> model is to allow a wide array of compiler transformations -
> including speculative loads - for non-atomic variables.
> I believe what's most likely happening (without looking at the
> Mozilla source) is that the original program contains a C++ data
> race, and the transformation exposes it to TSan.
>
>
>
> The original program is race-free.
> I've posted a minimized reproducer that actually triggers a tsan
> false positive at O1 here:
> https://code.google.com/p/thread-sanitizer/issues/detail?id=40#c5
>
>

Here's my problem:

int g;
int foo(int cond) {
if (cond)
return g;
return 0;
}

is being transformed into this (at the IR level):

int g;
int foo(int cond) {
return cond ? g : 0;

This doesn't have the same semantics as the IR transformed 'select',
though, does it?

In IR 'g' is unconditionally loaded before the select. In C++, 'g' is
completely unevaluated if 'cond' is false.

This is a good point. It is more like:
  int i = g;
  return cond ? i : 0;

-Hal

From: "David Blaikie" <dblaikie@gmail.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "Kostya Serebryany" <kcc@google.com>, "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>
Sent: Tuesday, November 19, 2013 12:19:46 PM
Subject: Re: [LLVMdev] Curiosity about transform changes under Sanitizers (Was: [PATCH] Disable branch folding with
MemorySanitizer)

> From: "Kostya Serebryany" < kcc@google.com >
> To: "Michael M Kuperstein" < michael.m.kuperstein@intel.com >
> Cc: "LLVM Developers Mailing List" < llvmdev@cs.uiuc.edu >
> Sent: Tuesday, November 19, 2013 11:45:39 AM
> Subject: Re: [LLVMdev] Curiosity about transform changes under
> Sanitizers (Was: [PATCH] Disable branch folding with
> MemorySanitizer)
>
>
>
>
>
>
>
>

>
>
> My $0.02 - I'm not sure the transformation introduces a data race.
>
> To the best of my understanding, the point of the C++11/C11 memory
> model is to allow a wide array of compiler transformations -
> including speculative loads - for non-atomic variables.
> I believe what's most likely happening (without looking at the
> Mozilla source) is that the original program contains a C++ data
> race, and the transformation exposes it to TSan.
>
>
>
> The original program is race-free.
> I've posted a minimized reproducer that actually triggers a tsan
> false positive at O1 here:
> https://code.google.com/p/thread-sanitizer/issues/detail?id=40#c5
>
>

Here's my problem:

int g;
int foo(int cond) {
if (cond)
return g;
return 0;
}

is being transformed into this (at the IR level):

int g;
int foo(int cond) {
return cond ? g : 0;

This doesn't have the same semantics as the IR transformed 'select',
though, does it?

In IR 'g' is unconditionally loaded before the select. In C++, 'g' is
completely unevaluated if 'cond' is false.

This is a good point. It is more like:
  int i = g;
  return cond ? i : 0;

And that becomes a data race.

-Hal

}

but I could have just as easily made that transformation at the C++
level as well, and I don't see that as introducing a data race here
(in practice). When you have a load that is used only conditionally,
TSAN should only record it when the condition is true. IMHO, to do
otherwise, is a bug in TSAN. Furthermore, if you try and do this by
disabling all optimizations that transform conditionally-executed
blocks into selects, you'll have a lot more work to do (because
you'll also need to handle every place in the backend that does if
conversion).

What you suggest is effectively reverting unwanted transformations in
TSan instrumentation pass, instead of disabling them. This is not
always possible, because some of them lose information about the
intent of the original code. As in your example, load as part of a ?:
subexpression that is not evaluated is ok; hoisting that load into a
separate statement creates a data race; both are presented to TSan as
identical IR. Similar examples could be made for load widening
(harmful for both ASan and TSan), etc.

Looks like the only _really_ correct way of doing it is with some kind
of help from the frontend. It may also improve speed by excluding
checks that don't make sense or are trivially false in the context of
the source language.

Sorry, you're right, I've read the spec more carefully, and there is no evaluation in the original code.

In fact, C11 addresses this case specifically:
Note 14 to 5.1.24:
"Transformations that introduce a speculative read of a potentially shared memory location may not preserve the semantics of the program as defined in this standard, since they potentially introduce a data race. However, they are typically valid in the context of an optimizing compiler that targets a specific machine with well-defined semantics for data races. They would be invalid for a hypothetical machine that is not tolerant of races or provides hardware race detection."

So, I guess, under C/C++ semantics, the transformation is "illegal unless you know what you're doing for the specific target", and having TSan would make it illegal.
Or does that interpretation sound wrong too? :slight_smile:

>> What I’m trying to say is that according to my understanding of the
C++11 memory model, even in that small reproducer, the store to g and the
load from g are in fact a data race.
>>
>> (This is regardless of the fact the load is protected by a branch that
is not taken.)
>
> My understanding of the standard is quite the opposite.

I agree. It is a potential data race, if the branch is taken then it
definitely is a race because the standard explicitly does not define an
ordering on loads and stores of non-atomic variables unless some
happens-before relationship is established by some other operation (which
does not occur in this example). In the case where the branch is not
taken, then there is no data race because in the C++ abstract machine there
is no load, whether there is one in the underlying concrete machine or not.

I believe that this transform is valid, because there are only two
possible cases here:

- Some other code has established a happens-before relationship between
the load and the stores, giving a well-defined ordering, however this code
is not visible in the function foo() and so the load may happen anywhere.

- No other code establishes a happens-before relationship, and therefore
the order of the load and the store is completely undefined and as long as
the load doesn't trap it is completely safe to hoist it.

HOWEVER, I would dispute describing this transform as an optimisation in
this case.

You have a good point: evaluating such transformation on single-threaded
benchmarks (e.g. SPEC) makes no sense
if we care about multi-threaded code. And evaluating performance of
anything like this on a multi-threaded code is hard too.
And it's very easy to prepare a synthetic test case where this optimization
will cause 4x slowdown (see below).
But if we disable this transformation someone surely will come and complain
:slight_smile:
Another data point: gcc 4.6.3 does this too.

==> if-bench.cc <==
#include <pthread.h>
#include <unistd.h>

extern int foo(int);
extern void bar(int);

#ifndef N
#define N (1UL << 30)
#endif

void *Thread(void *p) {
  for (long i = 0; i < N; i++)
     foo(0);
  return 0;
}

int main() {
  pthread_t t[3];
  pthread_create(&t[0], 0, Thread, 0);
  pthread_create(&t[1], 0, Thread, 0);
  pthread_create(&t[2], 0, Thread, 0);
  for (long i = 0; i < N; i++)
    bar(i);
  pthread_join(t[0], 0);
  pthread_join(t[1], 0);
  pthread_join(t[2], 0);
}

==> g.cc <==
int g;

int foo(int cond) {
  if (cond)
    return g;
  // If we replace 'g' with g*g*g*g, the benchmark becomes 4x faster.
  return 0;
}

void bar(int i) { g = i; }

% clang if-bench.cc g.cc -lpthread -O1 && time ./a.out

--kcc

FTR: http://llvm-reviews.chandlerc.com/D2227 sent for review to fix the tsan-hostile load speculation.

–kcc