Clang devirtualization proposal

Hi folks,
this summer I will work with Richard Smith on clang devirtualization. Check out our proposal:

https://docs.google.com/document/d/1f2SGa4TIPuBGm6y6YO768GrQsA8awNfGEJSBFukLhYA/edit?usp=sharing

And modified LangRef
http://reviews.llvm.org/D11399

You can also check out previous disscussion that was started before our proposal was ready - http://lists.cs.uiuc.edu/pipermail/cfe-dev/2015-July/044052.html

Regards
Piotr Padlewski

Hi Piotr,

You may be interested in a recent patch I posted: http://reviews.llvm.org/D11043

This patch addresses a de-virtualization case that I’m not sure would be handled by your current proposal, namely that of a virtual call where the ‘this’ object is a global variable.

For example:

struct A {

A();

virtual void foo();

};

void g(A * a) {

a->foo();

}

A a;

int main() {

g(&a);

}

It might be worth considering if your approach could be extended to handle this case.

HI,
Yep, our proposal doesn’t cover it, because this load ; icmp ; assume; will land global initilizer function, and main will not see it.
At least if foo would be called multiple times, then we would only have one load from vtable, but unfortunatelly we will not be able to inline, or make direct call to it with this approach.
I think that this case is rare enough to solve it right now.

Piotr

HI,
Yep, our proposal doesn't cover it, because this load ; icmp ; assume;
will land global initilizer function, and main will not see it.
At least if foo would be called multiple times, then we would only have
one load from vtable, but unfortunatelly we will not be able to inline, or
make direct call to it with this approach.
I think that this case is rare enough to solve it right now.

Piotr

Hi Piotr,

You may be interested in a recent patch I posted:
⚙ D11043 Const fold vtable load from global variable
<⚙ D11043 Const fold vtable load from global variable;

This patch addresses a de-virtualization case that I’m not sure would be
handled by your current proposal, namely that of a virtual call where the
‘this’ object is a global variable.

For example:

struct A {

  A();

  virtual void foo();

};

void g(A * a) {

  a->foo();

}

A a;

int main() {

  g(&a);

We could in principle emit the load/icmp/assume for a's vptr here, but I'm
concerned that this is not actually sound in general. Consider:

struct A;
extern A a;
void *p = &a; // emit load/icmp/assume here?
struct A { virtual void f(); };
A a;

If we emit an assume on the vptr at the point where the address of a is
taken, we get:

@a = global %A zeroinitializer
void @global_init() {
  load @a
  %x = icmp @a, @A.vtable
  call @llvm.assume(%x)
  store @a, @p
  call @A::A(@a)
}

... where the assume call can be folded to "call @llvm.assume(i1 false)",
which (presumably) can be reduced further to "unreachable" (oops). (In
Geoff's example, this is sound only because we're guaranteed that 'a' is
initialized before we enter 'main', but an optimization that only fires for
code within 'main' doesn't sound very useful.)

We could support this with something weaker than the current
load/icmp/assume pattern: we could imagine an instruction/intrinsic that
says "any later load of pointer %p with !invariant.group !g will load the
value %q", but that does not imply that the value %q is currently stored in
%p. It's not clear to me whether that has sufficient value to be worthwhile.

}

From: "Richard Smith" <richard@metafoo.co.uk>
To: "Piotr Padlewski" <prazek@google.com>
Cc: "llvmdev@cs.uiuc.edu Mailing List" <llvmdev@cs.uiuc.edu>, "cfe-dev@cs.uiuc.edu Developers" <cfe-dev@cs.uiuc.edu>
Sent: Thursday, July 23, 2015 4:21:42 PM
Subject: Re: [cfe-dev] [LLVMdev] Clang devirtualization proposal

HI,
Yep, our proposal doesn't cover it, because this load ; icmp ;
assume; will land global initilizer function, and main will not see
it.
At least if foo would be called multiple times, then we would only
have one load from vtable, but unfortunatelly we will not be able to
inline, or make direct call to it with this approach.
I think that this case is rare enough to solve it right now.

Piotr

Hi Piotr,

You may be interested in a recent patch I posted:
⚙ D11043 Const fold vtable load from global variable

This patch addresses a de-virtualization case that I’m not sure would
be handled by your current proposal, namely that of a virtual call
where the ‘this’ object is a global variable.

For example:

struct A {

A();

virtual void foo();

};

void g(A * a) {

a->foo();

}

A a;

int main() {

g(&a);

We could in principle emit the load/icmp/assume for a's vptr here,
but I'm concerned that this is not actually sound in general.
Consider:

struct A;
extern A a;
void *p = &a; // emit load/icmp/assume here?
struct A { virtual void f(); };
A a;

If we emit an assume on the vptr at the point where the address of a
is taken, we get:

@a = global %A zeroinitializer
void @global_init() {
load @a
%x = icmp @a, @A.vtable
call @llvm.assume(%x)
store @a, @p
call @A::A(@a)
}

... where the assume call can be folded to "call @llvm.assume(i1
false)", which (presumably) can be reduced further to "unreachable"

To confirm, yes, @llvm.assume(false) -> unreachable by llvm::removeUnreachableBlocks.

-Hal

Hi Piotr,

Thanks for posting this! First, a question. When you say, regarding i8* @llvm.invariant.group.barrier(i8*):

"Required to handle destructors, placement new and std::launder. Call of this function will be put on the end of each of this functions"

I completely understand placement new and std::launder. I don't understand destructors, could you explain?

Also, am I correct in saying that we could handle the case of 'final' classes I highlighted in initial e-mail by inserting these assumptions whenever a pointer/reference to a class of such a type came into scope?

struct A {
  virtual void foo() = 0;
};

struct B final : public A {
  void foo();
};

void entry(B *b) {
// emit assumptions about vtbl of 'b' here?
}

Thanks again,
Hal

Hi Piotr,

Thanks for posting this! First, a question. When you say, regarding i8*
@llvm.invariant.group.barrier(i8*):

"Required to handle destructors, placement new and std::launder. Call of
this function will be put on the end of each of this functions"

I completely understand placement new and std::launder. I don't understand
destructors, could you explain?

When a derived class destructor invokes a base class destructor, the
dynamic type of the object changes (as does the vptr), so we need an
invariant barrier to prevent the derived class's vptr being used for
virtual calls in an inlined base class destructor.

Also, am I correct in saying that we could handle the case of 'final'
classes I highlighted in initial e-mail by inserting these assumptions
whenever a pointer/reference to a class of such a type came into scope?

Yes, it would be correct to insert these assumptions anywhere where the
language standard guarantees that there exists an object of a known
(most-derived) dynamic type (and in particular, we can do this whenever we
know there exists an object of a known final type).

CodeGen already invokes EmitTypeCheck in many of the places where it's
guaranteed that an object of a known type exists; we could experiment with
adding an assumption from it any time that type is final.

struct A {

  virtual void foo() = 0;
};

struct B final : public A {
  void foo();
};

void entry(B *b) {
// emit assumptions about vtbl of 'b' here?

This case is tricky. We don't currently have a way of saying "assume that a
load of %b would load %B.vtbl" without also saying "assume that %b is
dereferenceable". We've seen other cases where that would be beneficial, so
perhaps that's something we should consider adding.

From: "Richard Smith" <richard@metafoo.co.uk>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "Piotr Padlewski" <prazek@google.com>, "cfe-dev@cs.uiuc.edu Developers" <cfe-dev@cs.uiuc.edu>,
"llvmdev@cs.uiuc.edu Mailing List" <llvmdev@cs.uiuc.edu>
Sent: Saturday, July 25, 2015 7:43:37 PM
Subject: Re: [cfe-dev] Clang devirtualization proposal

Hi Piotr,

Thanks for posting this! First, a question. When you say, regarding
i8* @llvm.invariant.group.barrier(i8*):

"Required to handle destructors, placement new and std::launder. Call
of this function will be put on the end of each of this functions"

I completely understand placement new and std::launder. I don't
understand destructors, could you explain?

When a derived class destructor invokes a base class destructor, the
dynamic type of the object changes (as does the vptr), so we need an
invariant barrier to prevent the derived class's vptr being used for
virtual calls in an inlined base class destructor.

Interesting. So we'll launder the 'this' pointer through @llvm.invariant.group.barrier before changing the vptr and calling the base class destructor?

Also, am I correct in saying that we could handle the case of 'final'
classes I highlighted in initial e-mail by inserting these
assumptions whenever a pointer/reference to a class of such a type
came into scope?

Yes, it would be correct to insert these assumptions anywhere where
the language standard guarantees that there exists an object of a
known (most-derived) dynamic type (and in particular, we can do this
whenever we know there exists an object of a known final type).

CodeGen already invokes EmitTypeCheck in many of the places where
it's guaranteed that an object of a known type exists; we could
experiment with adding an assumption from it any time that type is
final.

Sounds good.

struct A {
virtual void foo() = 0;
};

struct B final : public A {
void foo();
};

void entry(B *b) {
// emit assumptions about vtbl of 'b' here?

This case is tricky. We don't currently have a way of saying "assume
that a load of %b would load %B.vtbl" without also saying "assume
that %b is dereferenceable". We've seen other cases where that would
be beneficial, so perhaps that's something we should consider
adding.

Ah, good point. We could only do it for references currently. I think adding something in this direction makes sense, something that says, perhaps, "we don't know if %b is deferenceable, but if it is, and you were to load from it, you'd get the following."

Thanks again,
Hal

Having read through the proposal, I feel like I missing some of the background to understand the problem you’re trying to solve.

My mental model is that construction of an object creates a new abstract location in an infinite heap with each object infinitely far apart. Destruction of the object destroys the abstract location. As a result, destructing one object and constructing another produce unique incomparable abstract locations. The fact the two abstract locations might happen to share a physical address is irrelevant.

If I’m understanding the proposal correctly, this model works for most code. The key optimization you appear to want to perform is to recognize the fact that these two abstract locations occupy the same memory. In particular, you want to be able to return mustalias for alias(loc1, loc2). Another way of saying this is that you want to reason about abstract locations as defined by allocation/deallocation events rather than construction/destruction events. Is that a fair summary?

What I’m not clear on is why recognizing the two abstract locations share a physical address is important. Given that the contents of the abstract location before construction or after destruction are undefined (right?), what optimization does recognizing the mustalias relation enable?

Philip

I think this is incorrect. LLVM's model is closer to the second model, and
we need something like the first model to prevent erroneous
devirtualization.

The corner case for C++ is when the optimizer observes that two abstract
objects share the same physical memory location. In practice, this could
happen if the memory allocator becomes visible to the optimizer through
inlining. For illustration, do placement new into the stack memory of
another object. This is illustrated in example 2 of the proposal:

struct MyClass {
  virtual void foo();
};
struct MyOtherClass : MyClass {
  virtual void foo();
};
int main() {
  MyClass c;
  c.foo();
  // Reuse the storage temporarily. UB to access the object through ‘c’
  c.~MyClass();
  auto c2 = new (&c) MyOtherClass();
  c2->foo(); //fine, we have new pointer
  // c.foo() // UB, the type has changed

  // The storage has to contain a ‘MyClass’ when it goes out of scope.
  c2->~MyOtherClass();
  new (&c) MyClass(); // we have to get back to previous type because
calling destructor using c would be UB
}

Without @llvm.invariant.group.barrier, LLVM will probably replace %c2 with
%c here, since they are trivially the same.

With @llvm.invariant.group.barrier, the result of placement new will be a
distinct SSA value that LLVM can't reason about, and we won't accidentally
devirtualize c2->foo() to MyClass::foo.

There is, however, a small problem with this model. If the code happened to
do this:

  ...
  auto c2 = new (&c) MyOtherClass();
  assert(c2 == &c);
  ...

LLVM might once again replace %c2 with %c, causing bad devirtualization.

So, to phrase this differently, the @llvm.invariant.group.barrier is responsible for introducing a new abstract memory location and the optimizer is agreeing to never exploit the fact the new abstract memory location is in fact at the same address? Where in the execution would this new abstract location be introduced? Is it immediately before the placement new? If so, that would seem consistent. Is this well defined C++? My reading would be that it is. If this was realloc, it clearly wouldn’t be, but I’m not sure placement new has the same restrictions. Assuming that it is well defined C, this was exactly the counter example I was missing and the reason that reasoning about abstract locations per object doesn’t work. Thanks. Given that, I can see why we’re stuck with a single abstract location for the storage and need to add and remove the invariantness of a particular location. I’ll go take another read through the proposal with that in mind. Philip

Sorry, ignore the above. I got to the part below and forgot to update this before sending. I hadn’t yet internalized the bit about not being able to reason about disjoint abstract locations when I wrote this.

Ok, replying anew now that I understand why reasoning about abstract locations for each object doesn’t work.

The general idea of describing a set of load and stores which belong to a particular invariant group seems reasonable. I’ve got some questions/comments on the specifics, but the overall direction seems entirely workable for the specific problem you’re trying to solve.

Quoting from the google doc: “If we don’t know definition of some function, we assume that it will not call @llvm.invariant.group.barrier().”
This part really really bugs me. We generally try to assume minimal knowledge of external functions (i.e. they can do anything) and this assumption would invert that. Is there a way we can rephrase the proposal which avoids the need for this? I’m not quite clear what this assumption buys us.

Is there a particular reason why a load or store must belong to a single invariant group rather than be a member of several? I don’t have an immediate use case in mind, but it seems potentially useful.

“i8* @llvm.invariant.group.barrier(i8*): Given a pointer, produces another pointer that aliases the first but which is considered different for the purposes of load !invariant.group metadata.”
The definition here minorly bugs me. This might just be a matter of wordsmithing, but it seems strange to me that this can’t be defined in terms of the assumptions allowed about the memory through the two pointers. If I’m looking at this instruction in memory dependence analysis, what am I allowed to assume, not assume? The current definition doesn’t make this obvious. One option: “produces another pointer which aliases the first, but where any invariant assumptions introduced by invariant.group metadata have been striped away.”

The notion of using the assume seems to make sense. I could see an argument for extending the invariant.group metadata with a way to express the assumed value, but we could also make that extension at a later time if needed.

I’m wondering if there’s a problematic interaction with CSE here. Consider this example is pseudo LLVM IR:
v1 = load i64, %p, !invariant.group !Type1
; I called destructor/placement new for the same type, but that optimized entirely away
p2 = invariant.group.barrier(p1)
if (p1 != p2) return.
store i64 0, %p2, !invariant.group !Type1
v2 = load i64, %p2, !invariant.group !Type1
ret i64 v1 - v2

(Assume that !Type is used to describe a write once integer field within some class. Not all instances have the same integer value.)

Having CSE turn this into:
v1 = load i64, %p, !invariant.group !Type1
p2 = invariant.group.barrier(p1)
if (p1 != p2) return.
store i64 0, %p1, !invariant.group !Type1
v2 = load i64, %p1, !invariant.group !Type1
ret i64 v1 - v2

And then GVN turn this into:
v1 = load i64, %p, !invariant.group !Type1
p2 = invariant.group.barrier(p1)
if (p1 != p2) return.
ret i64 v1 - v1 (-> 0)

This doesn’t seem like the result I’d expect. Is there something about my initial IR which is wrong/invalid in some way? Is the invariant.group required to be specific to a single bitpattern across all usages within a function/module/context? That would be reasonable, but I don’t think is explicit said right now. It also makes !invariant.group effectively useless for describing constant fields which are constant per instance rather than per-class.

Philip

A* a = new A;
a->foo(); //outline virtual
a->foo();

If we will assume that foo calls @llvm.invariant.barrier, then we will not
be able to optimize the second call.

Piotr

Why not? If foo calls @llvm.invariant.group.barrier, then it would have to produce a new SSA value to accomplish anything which might effect the second call. Given the call is on “a”, not some return value from foo or a global variable, we know that any SSA value created inside foo isn’t relevant. We should end up a with two loads of the vtable using the same SSA value and the same invariant.group metadata. The later can be forwarded from the former without issue right? %a = …; %vtable1 = load %a + Y !invariant.group !0 %foo1 = load %vtable1 + X, !invariant.group !1 call %foo1(%a) %vtable2 = load %a + Y !invariant.group !0 ← Per state rules, this value forwards from previous vtable load %foo2 = load %vtable2 + X, !invariant.group !1 call %foo2(%a) Philip

Quoting from the google doc: "If we don’t know definition of some
function, we assume that it will not call @llvm.invariant.group.barrier()."
This part really really bugs me. We generally try to assume minimal
knowledge of external functions (i.e. they can do anything) and this
assumption would invert that. Is there a way we can rephrase the proposal
which avoids the need for this? I'm not quite clear what this assumption
buys us.

This is because without it the optimization will be useless. For example:
A* a = new A;
a->foo(); //outline virtual
a->foo();

If we will assume that foo calls @llvm.invariant.barrier, then we will not
be able to optimize the second call.

IIUC, given how @llvm.invariant.barrier is specified, you don't have
to actually assume that. Since the second load from `a` will be
through the same SSA value, it could not have been piped through
@llvm.invariant.barrier.

OTOH, if you have

A *a = new A;
a->foo();
bar(&a);
a->foo();

Then you'll (rightly) have to forgo optimizing the second call to a->foo().

-- Sanjoy

Right, you got the intention of the design. The foo call can placement new
into %a if it wants to, but it better put things back the way they were
before we return, because %a isn't changing.

We probably missed that sentence when reviewing the proposal. :slight_smile:

Yes, this family of examples scares me. :slight_smile: It seems we've discovered a new
device testing IR soundness. We used it to build a test case that shows
that 'readonly' on arguments without 'nocapture' doesn't let you forward
stores across such a call.

Consider this pseudo-IR and some possible transforms that I would expect to
be semantics preserving:

void f(i32* readonly %a, i32* %b) {
  llvm.assume(%a == %b)
  store i32 42, i32* %b
}
  ...
  %p = alloca i32
  store i32 13, i32* %p
  call f(i32* readonly %p, i32* %p)
  %r = load i32, i32* %p

; Propagate llvm.assume info
void f(i32* readonly %a, i32* %b) {
  store i32 42, i32* %a
}
  ...
  %p = alloca i32
  store i32 13, i32* %p
  call f(i32* readonly %p, i32* %p)
  %r = load i32, i32* %p

; Delete dead args
void f(i32* readonly %a) {
  store i32 42
}
  ...
  %p = alloca i32
  store i32 13, i32* %p
  call f(i32* readonly %p)
  %r = load i32, i32* %p

; Forward store %p to load %p, since the only use of %p is readonly
void f(i32* readonly %a) {
  store i32 42
}
  ...
  %p = alloca i32
  call f(i32* readonly %p)
  %r = i32 13

Today LLVM will not do the final transform because it requires readonly on
the entire function, or nocapture on the argument. nocapture cannot be
inferred due to the assume comparison.

I'd say this first transformation is incorrect. `readonly` is
effectively part of `%a`'s "type" as it constrains and affects the
operations you can do on `%a`. Even if `%b` is bitwise equivalent to
`%a` at runtime, it is "type incompatible" to replace `%a` with `%b`.

This is similar to how you cannot replace `store i32 42, i32
addrspace(1)* %a` with `store i32 42, i32 addrspace(2)* %b`, even if
you can prove `ptrtoint %a` == `ptrtoint %b` -- the nature of `store`
is dependent on the type of the pointer you store through.

The glitch in LLVM IR right now is that the `readonly`ness of `%a` is
not modeled in the type system, when I think it should be. An `i32
readonly*` should be a different type from `i32*`. In practice this
may be non-trivial to get right (for instance `phi`s and `selects`
will either have to do a type merge, or we'd have to have explicit
type operators at the IR level).

-- Sanjoy

From: "Sanjoy Das" <sanjoy@playingwithpointers.com>
To: "Reid Kleckner" <rnk@google.com>
Cc: "Piotr Padlewski" <prazek@google.com>, "cfe-dev@cs.uiuc.edu Developers" <cfe-dev@cs.uiuc.edu>, "LLVM Developers
Mailing List" <llvmdev@cs.uiuc.edu>
Sent: Saturday, August 1, 2015 1:22:50 AM
Subject: Re: [LLVMdev] [cfe-dev] Clang devirtualization proposal

> Consider this pseudo-IR and some possible transforms that I would
> expect to
> be semantics preserving:
>
> void f(i32* readonly %a, i32* %b) {
> llvm.assume(%a == %b)
> store i32 42, i32* %b
> }
> ...
> %p = alloca i32
> store i32 13, i32* %p
> call f(i32* readonly %p, i32* %p)
> %r = load i32, i32* %p
>
> ; Propagate llvm.assume info
> void f(i32* readonly %a, i32* %b) {
> store i32 42, i32* %a
> }
> ...
> %p = alloca i32
> store i32 13, i32* %p
> call f(i32* readonly %p, i32* %p)
> %r = load i32, i32* %p

I'd say this first transformation is incorrect. `readonly` is
effectively part of `%a`'s "type" as it constrains and affects the
operations you can do on `%a`. Even if `%b` is bitwise equivalent to
`%a` at runtime, it is "type incompatible" to replace `%a` with `%b`.

This is similar to how you cannot replace `store i32 42, i32
addrspace(1)* %a` with `store i32 42, i32 addrspace(2)* %b`, even if
you can prove `ptrtoint %a` == `ptrtoint %b` -- the nature of `store`
is dependent on the type of the pointer you store through.

The glitch in LLVM IR right now is that the `readonly`ness of `%a` is
not modeled in the type system, when I think it should be. An `i32
readonly*` should be a different type from `i32*`. In practice this
may be non-trivial to get right (for instance `phi`s and `selects`
will either have to do a type merge, or we'd have to have explicit
type operators at the IR level).

We could do this, but then we'd need to promote these things to first-class parts of the type system (and I'd need to put further thought about how this interacts with dynamically-true properties at callsites and inlining).

The alternative way of looking at it, which is true today, is that @llvm.assume is not removed even when its information is 'used'. It appears, given this example, that this is actually required for correctness, and that dead-argument elimination needs to specifically not ignore effectively-ephemeral values/arguments.

-Hal