Possible soundness issue with available_externally (split from "RFC: Add guard intrinsics")

Hi all,

This is something that came up in the "RFC: Add guard intrinsics to
LLVM" thread; and while I'm not exactly blocked on this, figuring out
a path forward here will be helpful in deciding if we can use the
available_externally linkage type to expression certain semantic
properties guard intrinsics will have.

Let's start with an example that shows that we have a problem (direct
copy/paste from the guard intrinsics thread). Say we have:

void foo() available_externally {
  %t0 = load atomic %ptr
  %t1 = load atomic %ptr
  if (%t0 != %t1) print("X");
}
void main() {
  foo();
  print("Y");
}

The possible behaviors of the above program are {print("X"),
print("Y")} or {print("Y")}. But if we run opt then we have

void foo() available_externally readnone nounwind {
  ;; After CSE'ing the two loads and folding the condition
}
void main() {
  foo();
  print("Y");
}

and some generic reordering

void foo() available_externally readnone nounwind {
  ;; After CSE'ing the two loads and folding the condition
}
void main() {
  print("Y");
  foo();  // legal since we're moving a readnone nounwind function that
          // was guaranteed to execute (hence can't have UB)
}

If we do not inline @foo(), and instead re-link the call site in @main
to some non-optimized copy (or differently optimized copy) of @foo,
then it is possible for the program to have the behavior {print("Y");
print ("X")}, which was disallowed in the earlier program.

In other words, opt refined the semantics of @foo() (i.e. reduced the
set of behaviors it may have) in ways that would make later
optimizations invalid if we de-refine the implementation of @foo().

The above example is clearly fabricated, but such cases can come up
even if everything is optimized to the same level. E.g. one of the
atomic loads in the unrefined implementation of @foo() could have been
hidden behind a function call, whose body existed in only one module.
That module would then be able to refine @foo() to `ret void` but
other modules won't.

The only solution I can think of is to redefine available_externally
to mean "the only kind of IPO/IPA you can do over a call to this
function is to inline it". Redefining available_externally this way
will also let us soundly use it to represent calls to functions that
have guard intrinsics, since a failed guard intrinsic basically
replaces the function with a "very de-refined" implementation (the
interpreter).

What do you think? I don't think implementing the above above will be
very difficult, but needless to say, it will still be a fairly
non-trivial semantic change (hence I'm not directly jumping to
implementation).

-- Sanjoy

Hi all,

This is something that came up in the "RFC: Add guard intrinsics to
LLVM" thread; and while I'm not exactly blocked on this, figuring out
a path forward here will be helpful in deciding if we can use the
available_externally linkage type to expression certain semantic
properties guard intrinsics will have.

Let's start with an example that shows that we have a problem (direct
copy/paste from the guard intrinsics thread). Say we have:

void foo() available_externally {
 %t0 = load atomic %ptr
 %t1 = load atomic %ptr
 if (%t0 != %t1) print("X");
}
void main() {
 foo();
 print("Y");
}

The possible behaviors of the above program are {print("X"),
print("Y")} or {print("Y")}. But if we run opt then we have

void foo() available_externally readnone nounwind {
 ;; After CSE'ing the two loads and folding the condition
}
void main() {
 foo();
 print("Y");
}

and some generic reordering

void foo() available_externally readnone nounwind {
 ;; After CSE'ing the two loads and folding the condition
}
void main() {
 print("Y");
 foo();  // legal since we're moving a readnone nounwind function that
         // was guaranteed to execute (hence can't have UB)
}

If we do not inline @foo(), and instead re-link the call site in @main
to some non-optimized copy (or differently optimized copy) of @foo,
then it is possible for the program to have the behavior {print("Y");
print ("X")}, which was disallowed in the earlier program.

In other words, opt refined the semantics of @foo() (i.e. reduced the
set of behaviors it may have) in ways that would make later
optimizations invalid if we de-refine the implementation of @foo().

I'm probably missing something obvious here. How could the result of
`%t0 != %t1` be different at optimization time in one file than from
runtime in the "real" implementation? Doesn't this make the CSE
invalid?

Does linkonce_odr linkage have the same problem?
- If so, do you want to change it too?
- Else, why not?

The above example is clearly fabricated, but such cases can come up
even if everything is optimized to the same level. E.g. one of the
atomic loads in the unrefined implementation of @foo() could have been
hidden behind a function call, whose body existed in only one module.
That module would then be able to refine @foo() to `ret void` but
other modules won't.

The only solution I can think of is to redefine available_externally
to mean "the only kind of IPO/IPA you can do over a call to this
function is to inline it". Redefining available_externally this way
will also let us soundly use it to represent calls to functions that
have guard intrinsics, since a failed guard intrinsic basically
replaces the function with a "very de-refined" implementation (the
interpreter).

What do you think? I don't think implementing the above above will be
very difficult, but needless to say, it will still be a fairly
non-trivial semantic change (hence I'm not directly jumping to
implementation).

This linkage is used in three places (that I know of) by clang:

  1. C-style `inline` functions.
  2. Functions defined in C++ template classes with external explicit
     instantiations, e.g. S::foo() in:

         template <class T> struct S { void foo() {} };
         void bar() { S<int>().foo(); }
         extern template struct S<int>;

  3. -flto=thin cross-module function importing.

(No comment on (1); its exact semantics are a little fuzzy to me.)
For (2) and (3), the current behaviour seems correct, and I'd be
hesitant to lose optimizing power. (2) is under the "ODR" rule, and
I think we've been applying the same logic to (3). Unless, are you
saying ODR isn't enough?

Assuming you need this new definition (but under ODR, the semantics
are correct), I would rather split the linkage than change it. E.g.,
use a name like available_externally_odr for (2) and (3).

If we do not inline @foo(), and instead re-link the call site in @main
to some non-optimized copy (or differently optimized copy) of @foo,
then it is possible for the program to have the behavior {print("Y");
print ("X")}, which was disallowed in the earlier program.

In other words, opt refined the semantics of @foo() (i.e. reduced the
set of behaviors it may have) in ways that would make later
optimizations invalid if we de-refine the implementation of @foo().

I'm probably missing something obvious here. How could the result of
`%t0 != %t1` be different at optimization time in one file than from
runtime in the "real" implementation? Doesn't this make the CSE
invalid?

`%t0` and `%t1` are "allowed" to "always be the same", i.e. an
implementation of @foo that always feeds in the same
value for `%t0` and `%t1` is a valid implementation (which is why the
CSE was valid); but it is not the *only* valid implementation. If I
don't CSE the two load instructions (also a valid thing to do), and
this is a second thread writing to `%par`, then the two values loaded
can be different, and you could end up printing `"X"` in `@foo`.

Did that make sense?

Does linkonce_odr linkage have the same problem?
- If so, do you want to change it too?
- Else, why not?

Going by the specification in the LangRef, I'd say it depends on how
you define "definitive". If you're allowed to replace the body of a
function with a differently optimized body, then the above problem
exists.

The above example is clearly fabricated, but such cases can come up
even if everything is optimized to the same level. E.g. one of the
atomic loads in the unrefined implementation of @foo() could have been
hidden behind a function call, whose body existed in only one module.
That module would then be able to refine @foo() to `ret void` but
other modules won't.

The only solution I can think of is to redefine available_externally
to mean "the only kind of IPO/IPA you can do over a call to this
function is to inline it". Redefining available_externally this way
will also let us soundly use it to represent calls to functions that
have guard intrinsics, since a failed guard intrinsic basically
replaces the function with a "very de-refined" implementation (the
interpreter).

What do you think? I don't think implementing the above above will be
very difficult, but needless to say, it will still be a fairly
non-trivial semantic change (hence I'm not directly jumping to
implementation).

This linkage is used in three places (that I know of) by clang:

  1. C-style `inline` functions.
  2. Functions defined in C++ template classes with external explicit
     instantiations, e.g. S::foo() in:

         template <class T> struct S { void foo() {} };
         void bar() { S<int>().foo(); }
         extern template struct S<int>;

  3. -flto=thin cross-module function importing.

(No comment on (1); its exact semantics are a little fuzzy to me.)
For (2) and (3), the current behaviour seems correct, and I'd be
hesitant to lose optimizing power. (2) is under the "ODR" rule, and
I think we've been applying the same logic to (3). Unless, are you
saying ODR isn't enough?

By ODR, do you mean you only have one definition of the function in
the whole link (i.e. across all modules you'll link together)?
Then yes, ODR should be enough to avoid this. But in any place where
the linker sees two differently optimized definitions for a function
and picks one as the definitive version all non-inlined calls link to,
we have this problem.

Assuming you need this new definition (but under ODR, the semantics
are correct), I would rather split the linkage than change it. E.g.,
use a name like available_externally_odr for (2) and (3).

If what I said above is correct (i.e. ODR == OD across everything
you're linking into your final executable) then splitting the linkage
/ adding a new one is probably the best alternative.

-- Sanjoy

If we do not inline @foo(), and instead re-link the call site in @main
to some non-optimized copy (or differently optimized copy) of @foo,
then it is possible for the program to have the behavior {print(“Y”);
print (“X”)}, which was disallowed in the earlier program.

In other words, opt refined the semantics of @foo() (i.e. reduced the
set of behaviors it may have) in ways that would make later
optimizations invalid if we de-refine the implementation of @foo().

I’m probably missing something obvious here. How could the result of
%t0 != %t1 be different at optimization time in one file than from
runtime in the “real” implementation? Doesn’t this make the CSE
invalid?

%t0 and %t1 are “allowed” to “always be the same”, i.e. an
implementation of @foo that always feeds in the same
value for %t0 and %t1 is a valid implementation (which is why the
CSE was valid); but it is not the only valid implementation. If I
don’t CSE the two load instructions (also a valid thing to do), and
this is a second thread writing to %par, then the two values loaded
can be different, and you could end up printing "X" in @foo.

Did that make sense?

Does linkonce_odr linkage have the same problem?

  • If so, do you want to change it too?
  • Else, why not?

Going by the specification in the LangRef, I’d say it depends on how
you define “definitive”. If you’re allowed to replace the body of a
function with a differently optimized body, then the above problem
exists.

I believe that is the case, and I strongly believe the problem you outline exists for linkonce_odr exactly as it does for available_externally.

Which is what makes this scary: every C++ inline function today can trigger this.

The above example is clearly fabricated, but such cases can come up
even if everything is optimized to the same level. E.g. one of the
atomic loads in the unrefined implementation of @foo() could have been
hidden behind a function call, whose body existed in only one module.
That module would then be able to refine @foo() to ret void but
other modules won’t.

The only solution I can think of is to redefine available_externally
to mean “the only kind of IPO/IPA you can do over a call to this
function is to inline it”. Redefining available_externally this way
will also let us soundly use it to represent calls to functions that
have guard intrinsics, since a failed guard intrinsic basically
replaces the function with a “very de-refined” implementation (the
interpreter).

What do you think? I don’t think implementing the above above will be
very difficult, but needless to say, it will still be a fairly
non-trivial semantic change (hence I’m not directly jumping to
implementation).

This linkage is used in three places (that I know of) by clang:

  1. C-style inline functions.
  2. Functions defined in C++ template classes with external explicit
    instantiations, e.g. S::foo() in:

template struct S { void foo() {} };
void bar() { S().foo(); }
extern template struct S;

  1. -flto=thin cross-module function importing.

(No comment on (1); its exact semantics are a little fuzzy to me.)
For (2) and (3), the current behaviour seems correct, and I’d be
hesitant to lose optimizing power. (2) is under the “ODR” rule, and
I think we’ve been applying the same logic to (3). Unless, are you
saying ODR isn’t enough?

By ODR, do you mean you only have one definition of the function in
the whole link (i.e. across all modules you’ll link together)?
Then yes, ODR should be enough to avoid this. But in any place where
the linker sees two differently optimized definitions for a function
and picks one as the definitive version all non-inlined calls link to,
we have this problem.

No, different levels of optimization must be allowed within ODR. So this is a problem within an ODR context.

(The term ODR applies to one source definition, not one optimized definition)

Typoe'd a little bit here: should be "and there is a second thread
writing to `%ptr`"

-- Sanjoy

>> If we do not inline @foo(), and instead re-link the call site in @main
>> to some non-optimized copy (or differently optimized copy) of @foo,
>> then it is possible for the program to have the behavior {print("Y");
>> print ("X")}, which was disallowed in the earlier program.
>>
>> In other words, opt refined the semantics of @foo() (i.e. reduced the
>> set of behaviors it may have) in ways that would make later
>> optimizations invalid if we de-refine the implementation of @foo().
>
> I'm probably missing something obvious here. How could the result of
> `%t0 != %t1` be different at optimization time in one file than from
> runtime in the "real" implementation? Doesn't this make the CSE
> invalid?

`%t0` and `%t1` are "allowed" to "always be the same", i.e. an
implementation of @foo that always feeds in the same
value for `%t0` and `%t1` is a valid implementation (which is why the
CSE was valid); but it is not the *only* valid implementation. If I
don't CSE the two load instructions (also a valid thing to do), and
this is a second thread writing to `%par`, then the two values loaded
can be different, and you could end up printing `"X"` in `@foo`.

Did that make sense?

> Does linkonce_odr linkage have the same problem?
> - If so, do you want to change it too?
> - Else, why not?

Going by the specification in the LangRef, I'd say it depends on how
you define "definitive". If you're allowed to replace the body of a
function with a differently optimized body, then the above problem
exists.

I believe that is the case, and I strongly believe the problem you outline
exists for linkonce_odr exactly as it does for available_externally.

Which is what makes this scary: every C++ inline function today can trigger
this.

More generally, the same problem applies to all entities within
comdats (we can view available_externally and linkonce_odr as special
cases of single-symbol comdats). Similar problems show up with static
local variables in inline functions (where the comdats may be
equivalent across object files, but aren't necessarily identical -- in
particular, one can use constant initialization for a static local
variable where the other uses dynamic initialization, and inlining the
one with constant initialization then selecting the one with dynamic
initialization results in the variable not being initialized).

One approach that appears superficially correct would be to say that
we cannot move information across a comdat / available_externally /
linkonce_odr boundary unless doing so allows us to completely remove
that reference to the comdat / function / global. (In particular, you
can't use the attrs on a function in a comdat from outside that comdat
-- unless you somehow know they're present for all versions of that
comdat.) But that seems to have a pretty huge impact on optimizability
of comdats.

>> If we do not inline @foo(), and instead re-link the call site in @main
>> to some non-optimized copy (or differently optimized copy) of @foo,
>> then it is possible for the program to have the behavior {print("Y");
>> print ("X")}, which was disallowed in the earlier program.
>>
>> In other words, opt refined the semantics of @foo() (i.e. reduced the
>> set of behaviors it may have) in ways that would make later
>> optimizations invalid if we de-refine the implementation of @foo().
>
> I'm probably missing something obvious here. How could the result of
> `%t0 != %t1` be different at optimization time in one file than from
> runtime in the "real" implementation? Doesn't this make the CSE
> invalid?

`%t0` and `%t1` are "allowed" to "always be the same", i.e. an
implementation of @foo that always feeds in the same
value for `%t0` and `%t1` is a valid implementation (which is why the
CSE was valid); but it is not the *only* valid implementation. If I
don't CSE the two load instructions (also a valid thing to do), and
this is a second thread writing to `%par`, then the two values loaded
can be different, and you could end up printing `"X"` in `@foo`.

Did that make sense?

Yes. To be sure I understand the scope: this is only a problem for
atomics, correct? (Because multi-threaded behaviour with other globals
is UB?)

> Does linkonce_odr linkage have the same problem?
> - If so, do you want to change it too?
> - Else, why not?

Going by the specification in the LangRef, I'd say it depends on how
you define "definitive". If you're allowed to replace the body of a
function with a differently optimized body, then the above problem
exists.

I believe that is the case, and I strongly believe the problem you outline exists for linkonce_odr exactly as it does for available_externally.

Which is what makes this scary: every C++ inline function today can trigger this.

Every C/C++ inline or template function. But only the ones that use
atomics, right?

Not that I'm sure that will end up being a helpful distinction.

If we do not inline @foo(), and instead re-link the call site in @main
to some non-optimized copy (or differently optimized copy) of @foo,
then it is possible for the program to have the behavior {print(“Y”);
print (“X”)}, which was disallowed in the earlier program.

In other words, opt refined the semantics of @foo() (i.e. reduced the
set of behaviors it may have) in ways that would make later
optimizations invalid if we de-refine the implementation of @foo().

I’m probably missing something obvious here. How could the result of
%t0 != %t1 be different at optimization time in one file than from
runtime in the “real” implementation? Doesn’t this make the CSE
invalid?

%t0 and %t1 are “allowed” to “always be the same”, i.e. an
implementation of @foo that always feeds in the same
value for %t0 and %t1 is a valid implementation (which is why the
CSE was valid); but it is not the only valid implementation. If I
don’t CSE the two load instructions (also a valid thing to do), and
this is a second thread writing to %par, then the two values loaded
can be different, and you could end up printing "X" in @foo.

Did that make sense?

Yes. To be sure I understand the scope: this is only a problem for
atomics, correct? (Because multi-threaded behaviour with other globals
is UB?)

Does linkonce_odr linkage have the same problem?

  • If so, do you want to change it too?
  • Else, why not?

Going by the specification in the LangRef, I’d say it depends on how
you define “definitive”. If you’re allowed to replace the body of a
function with a differently optimized body, then the above problem
exists.

I believe that is the case, and I strongly believe the problem you outline exists for linkonce_odr exactly as it does for available_externally.

Which is what makes this scary: every C++ inline function today can trigger this.

Every C/C++ inline or template function. But only the ones that use
atomics, right?

Well, with this example…

Not that I’m sure that will end up being a helpful distinction.

Right. See Richard’s comment. I think that sums up the real issue here. =/

>
> >> If we do not inline @foo(), and instead re-link the call site in
> >> @main
> >> to some non-optimized copy (or differently optimized copy) of @foo,
> >> then it is possible for the program to have the behavior {print("Y");
> >> print ("X")}, which was disallowed in the earlier program.
> >>
> >> In other words, opt refined the semantics of @foo() (i.e. reduced the
> >> set of behaviors it may have) in ways that would make later
> >> optimizations invalid if we de-refine the implementation of @foo().
> >
> > I'm probably missing something obvious here. How could the result of
> > `%t0 != %t1` be different at optimization time in one file than from
> > runtime in the "real" implementation? Doesn't this make the CSE
> > invalid?
>
> `%t0` and `%t1` are "allowed" to "always be the same", i.e. an
> implementation of @foo that always feeds in the same
> value for `%t0` and `%t1` is a valid implementation (which is why the
> CSE was valid); but it is not the *only* valid implementation. If I
> don't CSE the two load instructions (also a valid thing to do), and
> this is a second thread writing to `%par`, then the two values loaded
> can be different, and you could end up printing `"X"` in `@foo`.
>
> Did that make sense?

Yes. To be sure I understand the scope: this is only a problem for
atomics, correct? (Because multi-threaded behaviour with other globals
is UB?)

> > Does linkonce_odr linkage have the same problem?
> > - If so, do you want to change it too?
> > - Else, why not?
>
> Going by the specification in the LangRef, I'd say it depends on how
> you define "definitive". If you're allowed to replace the body of a
> function with a differently optimized body, then the above problem
> exists.
>
> I believe that is the case, and I strongly believe the problem you
> outline exists for linkonce_odr exactly as it does for available_externally.
>
> Which is what makes this scary: every C++ inline function today can
> trigger this.

Every C/C++ inline or template function. But only the ones that use
atomics, right?

Well, with *this* example...

Atomic are one source of non-determinism that compilers can reason
about. I don't know if the following snippet is well defined or not,
but you could have similar issues with

  void foo() {
    int *p = malloc(sizeof(int));
    if (*p < 10) print("X");
  }

or (again, I don't know if this is actually well defined)

  void foo() {
    int t; // it is probably reasonable to fold compares with
ptrtoint(alloca) to undef
    if ((intptr_t)(&t) < 10) print("X");
  }

-- Sanjoy

If we do not inline @foo(), and instead re-link the call site in
@main
to some non-optimized copy (or differently optimized copy) of @foo,
then it is possible for the program to have the behavior {print("Y");
print ("X")}, which was disallowed in the earlier program.

In other words, opt refined the semantics of @foo() (i.e. reduced the
set of behaviors it may have) in ways that would make later
optimizations invalid if we de-refine the implementation of @foo().

I'm probably missing something obvious here. How could the result of
`%t0 != %t1` be different at optimization time in one file than from
runtime in the "real" implementation? Doesn't this make the CSE
invalid?

`%t0` and `%t1` are "allowed" to "always be the same", i.e. an
implementation of @foo that always feeds in the same
value for `%t0` and `%t1` is a valid implementation (which is why the
CSE was valid); but it is not the *only* valid implementation. If I
don't CSE the two load instructions (also a valid thing to do), and
this is a second thread writing to `%par`, then the two values loaded
can be different, and you could end up printing `"X"` in `@foo`.

Did that make sense?

Yes. To be sure I understand the scope: this is only a problem for
atomics, correct? (Because multi-threaded behaviour with other globals
is UB?)

Does linkonce_odr linkage have the same problem?
- If so, do you want to change it too?
- Else, why not?

Going by the specification in the LangRef, I'd say it depends on how
you define "definitive". If you're allowed to replace the body of a
function with a differently optimized body, then the above problem
exists.

I believe that is the case, and I strongly believe the problem you
outline exists for linkonce_odr exactly as it does for available_externally.

Which is what makes this scary: every C++ inline function today can
trigger this.

Every C/C++ inline or template function. But only the ones that use
atomics, right?

Well, with *this* example...

Atomic are one source of non-determinism that compilers can reason
about. I don't know if the following snippet is well defined or not,
but you could have similar issues with

void foo() {
   int *p = malloc(sizeof(int));
   if (*p < 10) print("X");
}

or (again, I don't know if this is actually well defined)

void foo() {
   int t; // it is probably reasonable to fold compares with
ptrtoint(alloca) to undef
   if ((intptr_t)(&t) < 10) print("X");
}

The first one at least is UB, but as Richard pointed out the scope
is certainly broader than atomics (it's not even just well-defined
non-deterministism).

I'm kind of terrified by the implications.

If we do not inline @foo(), and instead re-link the call site in
@main
to some non-optimized copy (or differently optimized copy) of @foo,
then it is possible for the program to have the behavior {print("Y");
print ("X")}, which was disallowed in the earlier program.

In other words, opt refined the semantics of @foo() (i.e. reduced the
set of behaviors it may have) in ways that would make later
optimizations invalid if we de-refine the implementation of @foo().

I'm probably missing something obvious here. How could the result of
`%t0 != %t1` be different at optimization time in one file than from
runtime in the "real" implementation? Doesn't this make the CSE
invalid?

`%t0` and `%t1` are "allowed" to "always be the same", i.e. an
implementation of @foo that always feeds in the same
value for `%t0` and `%t1` is a valid implementation (which is why the
CSE was valid); but it is not the *only* valid implementation. If I
don't CSE the two load instructions (also a valid thing to do), and
this is a second thread writing to `%par`, then the two values loaded
can be different, and you could end up printing `"X"` in `@foo`.

Did that make sense?

Yes. To be sure I understand the scope: this is only a problem for
atomics, correct? (Because multi-threaded behaviour with other globals
is UB?)

Does linkonce_odr linkage have the same problem?
- If so, do you want to change it too?
- Else, why not?

Going by the specification in the LangRef, I'd say it depends on how
you define "definitive". If you're allowed to replace the body of a
function with a differently optimized body, then the above problem
exists.

I believe that is the case, and I strongly believe the problem you
outline exists for linkonce_odr exactly as it does for available_externally.

Which is what makes this scary: every C++ inline function today can
trigger this.

Every C/C++ inline or template function. But only the ones that use
atomics, right?

Well, with *this* example...

Atomic are one source of non-determinism that compilers can reason
about. I don't know if the following snippet is well defined or not,
but you could have similar issues with

  void foo() {
    int *p = malloc(sizeof(int));
    if (*p < 10) print("X");
  }

or (again, I don't know if this is actually well defined)

  void foo() {
    int t; // it is probably reasonable to fold compares with
ptrtoint(alloca) to undef
    if ((intptr_t)(&t) < 10) print("X");
  }

The first one at least is UB, but as Richard pointed out the scope
is certainly broader than atomics (it's not even just well-defined
non-deterministism).

I'm kind of terrified by the implications.

Me too. :frowning:

Yea, I’m pretty sad about all of this. I’m also not seeing a lot of awesome paths forward.

Here is the least bad strategy I can come up with. Curious if folks think this is sufficient:

  1. Stop deducing function attributes within comdats by examining the bodies of the functions (so that we remain free to transform the bodies of functions).
  2. Teach frontends to emit (even at O0!!!) trivially deduced function attributes for comdats so that we continue to catch easy cases.
  3. Ensure and specify that we never hoist code into a comdat group in which it would not have been executed previously. I don’t know of anything in LLVM that does this today, but it would become an important invariant.
  4. Work a lot harder to do internalizing and removing of this restriction.

Pretty horrible. But I think it is correct.

As a slight modification to #1 and #2, we could have a very carefully crafted deduction rule where we only deduce function attributes for functions prior to any modification of their function bodies. Such attributes should be conservatively correct because we would never lift new code into the function bodies. This would at least allow us to do bottom-up deduction to catch interprocedural cases. But it would become incredibly subtle that this is only valid prior to any transformations of the comdat-containing functions.

I’m starting to think this subtle rule might be worth it. But I’m frankly terrified by the implications.

A clarification pointed out by David Majnemer: what I"m really talking about is “comdat or comdat-semantic-equivalents” which include linkonce_odr and available_externally.

As a further strategy to recover optimizations:

On platforms with comdat support, we could also teach function attrs that deduction is safe for functions which are only called from within their own comdat. And then we could do some work in Clang (or other FEs) to try to form larger comdats when safe to do so. For example with C++, inline method definitions inside of class bodies could have a merged comdat I think because the class body can’t change between translation units. This would in turn allow, for example, continuing to get full deduction for private inline methods inside of class bodies whose callers were all also inline and inside the class body.

We could probably also do some tricks where we actually expand comdats themselves in order to enable deduction and subsequnet optimizations, although that might need a cost model.

-Chandler

From: "Chandler Carruth via llvm-dev" <llvm-dev@lists.llvm.org>
To: "Philip Reames" <listmail@philipreames.com>, "Duncan P. N. Exon
Smith" <dexonsmith@apple.com>, "Sanjoy Das"
<sanjoy@playingwithpointers.com>
Cc: "llvm-dev" <llvm-dev@lists.llvm.org>
Sent: Wednesday, February 24, 2016 10:29:23 PM
Subject: Re: [llvm-dev] Possible soundness issue with
available_externally (split from "RFC: Add guard intrinsics")

Yea, I'm pretty sad about all of this. I'm also not seeing a lot of
awesome paths forward.

Here is the least bad strategy I can come up with. Curious if folks
think this is sufficient:

I may not completely understand the problem, but this seems like overkill. The underlying restriction is that, if the compiler makes a non-determinism-collapsing choice when optimizing a function, it must make the same choice for all definitions of that function (undefined behavior excluded). Thus, with an externally_available function, the CSE in Sanjoy's original example should be forbidden. Richard's example again demonstrates this principle, although in this case the non-determinism is in the choice of a globally-visible implementation technique rather than non-determinism from memory-subsystem reordering.

There is a complication, which you imply in your proposal, that such optimizations need to be forbidden not just in the externally_available functions themselves, but in any local function transitively called by one. This, however, we can take care of with an (easily-deduced) attribute.

In short, it is not clear to me that the number of problematic optimizations is large (seems likely restricted to things involving atomics in practice), and while I understand the auditing difficulties here, we should just restrict these in appropriate contexts instead of trying to restrict all information flow into or out of comdats.

-Hal

From: “Chandler Carruth via llvm-dev” <llvm-dev@lists.llvm.org>
To: “Philip Reames” <listmail@philipreames.com>, “Duncan P. N. Exon
Smith” <dexonsmith@apple.com>, “Sanjoy Das”
<sanjoy@playingwithpointers.com>
Cc: “llvm-dev” <llvm-dev@lists.llvm.org>
Sent: Wednesday, February 24, 2016 10:29:23 PM
Subject: Re: [llvm-dev] Possible soundness issue with
available_externally (split from “RFC: Add guard intrinsics”)

Yea, I’m pretty sad about all of this. I’m also not seeing a lot of
awesome paths forward.

Here is the least bad strategy I can come up with. Curious if folks
think this is sufficient:

I may not completely understand the problem, but this seems like overkill. The underlying restriction is that, if the compiler makes a non-determinism-collapsing choice when optimizing a function, it must make the same choice for all definitions of that function (undefined behavior excluded).

This isn’t enough, because some definition in some other module may not be optimized at all, and yet may get selected at link time.

Put another way, it must prove that the same choice will always be made for all definitions. This is akin to proving that the optimizer is run over all translation units for C++ linkonce_odr functions, which you can’t do.

The result would be failing to optimize the bodies of linkonce_odr functions in any way which was externally detectable such as this. I think that would be much worse than losing the ability to do function attribute deduction for such functions?

From: "Chandler Carruth" <chandlerc@google.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "llvm-dev" <llvm-dev@lists.llvm.org>, "Philip Reames"
<listmail@philipreames.com>, "Duncan P. N. Exon Smith"
<dexonsmith@apple.com>, "Sanjoy Das"
<sanjoy@playingwithpointers.com>
Sent: Wednesday, February 24, 2016 11:41:59 PM
Subject: Re: [llvm-dev] Possible soundness issue with
available_externally (split from "RFC: Add guard intrinsics")

> > From: "Chandler Carruth via llvm-dev" < llvm-dev@lists.llvm.org >

> > To: "Philip Reames" < listmail@philipreames.com >, "Duncan P. N.
> > Exon

> > Smith" < dexonsmith@apple.com >, "Sanjoy Das"

> > < sanjoy@playingwithpointers.com >

> > Cc: "llvm-dev" < llvm-dev@lists.llvm.org >

> > Sent: Wednesday, February 24, 2016 10:29:23 PM

> > Subject: Re: [llvm-dev] Possible soundness issue with

> > available_externally (split from "RFC: Add guard intrinsics")

> > Yea, I'm pretty sad about all of this. I'm also not seeing a lot
> > of

> > awesome paths forward.

> > Here is the least bad strategy I can come up with. Curious if
> > folks

> > think this is sufficient:

> I may not completely understand the problem, but this seems like
> overkill. The underlying restriction is that, if the compiler makes
> a non-determinism-collapsing choice when optimizing a function, it
> must make the same choice for all definitions of that function
> (undefined behavior excluded).

This isn't enough, because some definition in some other module may
*not be optimized at all*, and yet may get selected at link time.

Put another way, it must *prove* that the same choice will *always*
be made for all definitions. This is akin to proving that the
optimizer is run over all translation units for C++ linkonce_odr
functions, which you can't do.

Sure; which is way I said that we should not perform those optimizations (instead of saying that we just need to make sure that the same choice will be made everywhere - as you say, LTO aside, we can't do that).

The result would be failing to optimize the bodies of linkonce_odr
functions in any way which was externally detectable such as this. I
think that would be *much* worse than losing the ability to do
function attribute deduction for such functions?

But it is not all optimizations that are the problem. Rather, it seems like a select few (e.g. things involving collapsing allowed non-determinism in atomics), and losing those optimizations seems better than generally losing function-attribute deduction.

-Hal

From: “Chandler Carruth via llvm-dev” <llvm-dev@lists.llvm.org>
To: “Philip Reames” <listmail@philipreames.com>, “Duncan P. N. Exon
Smith” <dexonsmith@apple.com>, “Sanjoy Das”
<sanjoy@playingwithpointers.com>
Cc: “llvm-dev” <llvm-dev@lists.llvm.org>
Sent: Wednesday, February 24, 2016 10:29:23 PM
Subject: Re: [llvm-dev] Possible soundness issue with
available_externally (split from “RFC: Add guard intrinsics”)

Yea, I’m pretty sad about all of this. I’m also not seeing a lot of
awesome paths forward.

Here is the least bad strategy I can come up with. Curious if folks
think this is sufficient:

I may not completely understand the problem, but this seems like overkill. The underlying restriction is that, if the compiler makes a non-determinism-collapsing choice when optimizing a function, it must make the same choice for all definitions of that function (undefined behavior excluded).

This isn’t enough, because some definition in some other module may not be optimized at all, and yet may get selected at link time.

Put another way, it must prove that the same choice will always be made for all definitions. This is akin to proving that the optimizer is run over all translation units for C++ linkonce_odr functions, which you can’t do.

Even if the optimizer is ran, it could take different decision because the context would be different:

linkonce_odr foo() {
bar();
}

If bar() is present in the TU it can gets inlined into foo(). So the optimizer would optimize differently foo().

The result would be failing to optimize the bodies of linkonce_odr functions in any way which was externally detectable such as this. I think that would be much worse than losing the ability to do function attribute deduction for such functions?

Thus, with an externally_available function, the CSE in Sanjoy’s original example should be forbidden. Richard’s example again demonstrates this principle, although in this case the non-determinism is in the choice of a globally-visible implementation technique rather than non-determinism from memory-subsystem reordering.

There is a complication, which you imply in your proposal, that such optimizations need to be forbidden not just in the externally_available functions themselves, but in any local function transitively called by one. This, however, we can take care of with an (easily-deduced) attribute.

I’m not sure why " such optimizations need to be forbidden […] in any local function transitively called by one", can you illustrate with an example?

From: “Chandler Carruth via llvm-dev” <llvm-dev@lists.llvm.org>
To: “Philip Reames” <listmail@philipreames.com>, “Duncan P. N. Exon
Smith” <dexonsmith@apple.com>, “Sanjoy Das”
<sanjoy@playingwithpointers.com>
Cc: “llvm-dev” <llvm-dev@lists.llvm.org>
Sent: Wednesday, February 24, 2016 10:29:23 PM
Subject: Re: [llvm-dev] Possible soundness issue with
available_externally (split from “RFC: Add guard intrinsics”)

Yea, I’m pretty sad about all of this. I’m also not seeing a lot of
awesome paths forward.

Here is the least bad strategy I can come up with. Curious if folks
think this is sufficient:

I may not completely understand the problem, but this seems like overkill. The underlying restriction is that, if the compiler makes a non-determinism-collapsing choice when optimizing a function, it must make the same choice for all definitions of that function (undefined behavior excluded).

This isn’t enough, because some definition in some other module may not be optimized at all, and yet may get selected at link time.

Put another way, it must prove that the same choice will always be made for all definitions. This is akin to proving that the optimizer is run over all translation units for C++ linkonce_odr functions, which you can’t do.

The result would be failing to optimize the bodies of linkonce_odr functions in any way which was externally detectable such as this. I think that would be much worse than losing the ability to do function attribute deduction for such functions?

If you are prior to optimising the function in any way (which I think is a reasonable thing to require as you said before) then any attributes added there will at least be conservatively correct. That will still allow a bunch of optimisation, I hope!

So for the following function:

int foo(int *i) {
if (*i == 1)
return 0;
return 1;
}

You can legitimately add readonly and propagate to callers that the returned value has known zero bits > 1.

And for this similar function:

int foo(int *i, bool b) {
if (b) {
*i = 1;
return 0;
}
return 1;
}

You could legitimately optimise a call ‘foo(p, false)’ so that the call site propagates in the constants and knows that this call returns 1 for sure.

But could you constant propagate (without inlining) in to the call and add readonly to the call instruction here? I think you can because you know that passing false for b means that the function will not take the path that loads memory.

Pete

Chandler Carruth wrote:

A clarification pointed out by David Majnemer: what I"m really talking about is "comdat or comdat-semantic-equivalents"
which include linkonce_odr and available_externally.

As a further strategy to recover optimizations:

On platforms with comdat support, we could also teach function attrs that deduction is safe for functions which are only
called from within their own comdat. And then we could do some work in Clang (or other FEs) to try to form larger
comdats when safe to do so. For example with C++, inline method definitions inside of class bodies could have a merged
comdat I think because the class body can't change between translation units. This would in turn allow, for example,
continuing to get full deduction for private inline methods inside of class bodies whose callers were all also inline
and inside the class body.

The tricky bit here is that we have to ensure that there is no
transitive path (i.e. not just direct calls) from outside the comdat
to the functions in the comdat we want to -functionattrs on.
Otherwise we'll run into issues where a comdat contains {A,B}, and
only B calls A, but B is called from C, which is outside the comdat.
In such a case you don't want to -functionattr A, and then inline B
into C.

-- Sanjoy

From: "Mehdi Amini" <mehdi.amini@apple.com>
To: "Chandler Carruth" <chandlerc@google.com>
Cc: "Hal Finkel" <hfinkel@anl.gov>, "llvm-dev"
<llvm-dev@lists.llvm.org>
Sent: Thursday, February 25, 2016 12:02:16 AM
Subject: Re: [llvm-dev] Possible soundness issue with
available_externally (split from "RFC: Add guard intrinsics")

>

> > > From: "Chandler Carruth via llvm-dev" < llvm-dev@lists.llvm.org
> > > >
>

> > > To: "Philip Reames" < listmail@philipreames.com >, "Duncan P.
> > > N.
> > > Exon
>

> > > Smith" < dexonsmith@apple.com >, "Sanjoy Das"
>

> > > < sanjoy@playingwithpointers.com >
>

> > > Cc: "llvm-dev" < llvm-dev@lists.llvm.org >
>

> > > Sent: Wednesday, February 24, 2016 10:29:23 PM
>

> > > Subject: Re: [llvm-dev] Possible soundness issue with
>

> > > available_externally (split from "RFC: Add guard intrinsics")
>

> > > Yea, I'm pretty sad about all of this. I'm also not seeing a
> > > lot
> > > of
>

> > > awesome paths forward.
>

> > > Here is the least bad strategy I can come up with. Curious if
> > > folks
>

> > > think this is sufficient:
>

> > I may not completely understand the problem, but this seems like
> > overkill. The underlying restriction is that, if the compiler
> > makes
> > a non-determinism-collapsing choice when optimizing a function,
> > it
> > must make the same choice for all definitions of that function
> > (undefined behavior excluded).
>

> This isn't enough, because some definition in some other module may
> *not be optimized at all*, and yet may get selected at link time.

> Put another way, it must *prove* that the same choice will *always*
> be made for all definitions. This is akin to proving that the
> optimizer is run over all translation units for C++ linkonce_odr
> functions, which you can't do.

Even if the optimizer is ran, it could take different decision
because the context would be different:

linkonce_odr foo() {
bar();
}

If bar() is present in the TU it can gets inlined into foo(). So the
optimizer would optimize differently foo().

> The result would be failing to optimize the bodies of linkonce_odr
> functions in any way which was externally detectable such as this.
> I
> think that would be *much* worse than losing the ability to do
> function attribute deduction for such functions?

> > Thus, with an externally_available function, the CSE in Sanjoy's
> > original example should be forbidden. Richard's example again
> > demonstrates this principle, although in this case the
> > non-determinism is in the choice of a globally-visible
> > implementation technique rather than non-determinism from
> > memory-subsystem reordering.
>

> > There is a complication, which you imply in your proposal, that
> > such
> > optimizations need to be forbidden not just in the
> > externally_available functions themselves, but in any local
> > function
> > transitively called by one. This, however, we can take care of
> > with
> > an (easily-deduced) attribute.
>

I'm not sure why " such optimizations need to be forbidden [...] in
any local function transitively called by one", can you illustrate
with an example?

Take Sanjoy's example with the two atomic loads, but outline the body of the function into some private function. The same restrictions need apply.

-Hal