Question about optimization of new and delete

Hello.

Probably this is an old enough question, sorry in this case.

I was wondering which optimizations clang/llvm does (or can do, given restrictions from the standard)
about the use of default new/delete operators.

In particular, I'm wondering about a couple of interesting cases:

1) Can clang convert a call to default operator new to stack allocation, if it can prove that
the lifetime of the allocated memory is the same as the enclosing scope?
Would it be someway contrary to the standard or somewhat undesirable?

Trivial example:

void func() {
    Class *obj = new Class(args);
    obj->stuff();
    delete obj;
}

Trying to compile something simple like this, I see the couple of operator calls are elided, but in more complex cases in
my code this does not happen, and I would like to know why (is the effect of a tradeoff with the size of the allocated memory?)

2) Can clang/llvm identify as a dead store a write to a piece of memory that will be delete?

void func(Class *obj) {
    obj->member = 42; // Is this store removed?
    delete obj;
}

Compiling this exact code with -O3, I see the store still there.

Thank you very much.

Nicola

Hello.

Probably this is an old enough question, sorry in this case.

Heh, not really. We're just now getting the C++ standard fixed:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3664.html

I was wondering which optimizations clang/llvm does (or can do, given
restrictions from the standard)
about the use of default new/delete operators.

See the paper for the high level theory of what the standard allows (once
it goes in). The current plan in Clang is to follow that paper's strategy
of optimization even in older standard modes unless there is significant
existing code broken by it.

In particular, I'm wondering about a couple of interesting cases:

1) Can clang convert a call to default operator new to stack allocation,
if it can prove that
the lifetime of the allocated memory is the same as the enclosing scope?
Would it be someway contrary to the standard or somewhat undesirable?

We can do this, but it may cause an unacceptable growth in stack usage. Hal
Finkel has recently started work on an optimization pass to do precisely
this, and I'm helping review and get it into the tree. I would expect Clang
to start doing this soon for small, constant size allocations.

Currently, Clang only really does this when it can delete the allocation
entirely, which is probably why you see confusing results from your testing.

2) Can clang/llvm identify as a dead store a write to a piece of memory

that will be delete?

void func(Class *obj) {
    obj->member = 42; // Is this store removed?
    delete obj;
}

Compiling this exact code with -O3, I see the store still there.

Huh. No, I bet we miss this. This is almost certainly just a missed
optimization that we should get.

However, note that all of Clang's optimizations of this form are mostly
conducted by the LLVM optimization libraries Clang is built on, and so the
fix to this missed optimization will likely involve mostly a change to LLVM.

Uh, ok :slight_smile:

What could be broken? I think such source code might be quite perverse…

This applies to most common use cases, right? Most objects are little…
Anyway, what will be the criterion for the decision of which allocation to lower to the stack and which not (e.g. what does “small” mean)?

Yeah, of course. But what I don’t know is how does LLVM know at IR-level that a call to _Znam allocates memory and is paired with a call to _Zdam that frees it? It’s special knowledge about this symbols, or some attribute put there in the extern symbol declaration?
For example, for this dead store, can you model in the IR that the memory will be lost or do you need a special case in the dead store elimination pass? Anyway, is it C++ specific or is there some generic architecture to model dynamic allocation (I believed not)?

I ask this because in another project I’m writing a llvm-based compiler and I’ll want to apply those kind of optimizations as well.

Maybe, it could be worth to use some intrisincs, lowered by a front-end specific pass, that model dynamic allocation primitives.
What do you think?

Thanks,
Nicola

I've been fighting similar patterns recently. The IR looks like this:

define void @_Z4funcP5Class(%struct.Class* %obj) #0 {
entry:
  %member = getelementptr inbounds %struct.Class* %obj, i64 0, i32 0
  store i32 42, i32* %member, align 4, !tbaa !0
  %isnull = icmp eq %struct.Class* %obj, null
  br i1 %isnull, label %delete.end, label %delete.notnull

delete.notnull: ; preds = %entry
  %0 = bitcast %struct.Class* %obj to i8*
  tail call void @_ZdlPv(i8* %0) #2
  br label %delete.end

delete.end: ; preds = %delete.notnull, %entry
  ret void
}

The superfluous NULL check here is preventing the elimination of the store. In my opinion we should optimize this based on the assumption that a pointer can't be NULL after a load/store to it. GCC does this and got some flak because it turned a crash into a security problem in the Linux kernel (-fno-delete-null-pointer-checks was introduced as a remedy). I'm not sure how to best implement that with our current VRP infrastructure.

- Ben

Anyway, is it C++ specific or is there some generic architecture to model dynamic allocation (I believed not)?

Most of this is based in lib/Analysis/MemoryBuiltins.cpp, which basically splits things into "prototypical types of behaviour" for functions that LLVM knows how to do useful transformations with, and some plumbing to connect the function names (or mangled function names in C++) to the appropriate behaviour.

Cheers,
Dave

-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2557590
ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2548782

Heh, not really. We're just now getting the C++ standard fixed:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3664.html

Uh, ok :slight_smile:

See the paper for the high level theory of what the standard allows
(once it goes in). The current plan in Clang is to follow that
paper's strategy of optimization even in older standard modes unless
there is significant existing code broken by it.

What could be broken? I think such source code might be quite
perverse..

We can do this, but it may cause an unacceptable growth in stack
usage. Hal Finkel has recently started work on an optimization pass
to do precisely this, and I'm helping review and get it into the
tree. I would expect Clang to start doing this soon for small,
constant size allocations.

This applies to most common use cases, right? Most objects are
little…
Anyway, what will be the criterion for the decision of which
allocation to lower to the stack and which not (e.g. what does
"small" mean)?

The criteria have not yet been decided, and I think some experimentation will be necessary. In the current implementation, the conversion happens for all allocations provably less than 1024 bytes. However, we may want to cap the total size of converted mallocs in addition to, or instead of, the individual sizes. Maybe this cap should depend on how much stack memory is already being requested by the function. I'm certainly open to suggestion.

2) Can clang/llvm identify as a dead store a write to a piece of
memory that will be delete?

void func(Class *obj) {
obj->member = 42; // Is this store removed?
delete obj;
}

Compiling this exact code with -O3, I see the store still there.

Huh. No, I bet we miss this. This is almost certainly just a missed
optimization that we should get.

Is there a call to the destructor after the store? If so, then we might not know that the destructor does not access the stored value.

However, note that all of Clang's optimizations of this form are
mostly conducted by the LLVM optimization libraries Clang is built
on, and so the fix to this missed optimization will likely involve
mostly a change to LLVM.

Yeah, of course. But what I don't know is how does LLVM know at
IR-level that a call to _Znam allocates memory and is paired with a
call to _Zdam that frees it? It's special knowledge about this
symbols, or some attribute put there in the extern symbol
declaration?

Yes, LLVM has special knowledge of certain external symbols. In this case, the relevant code is in: lib/Analysis/MemoryBuiltins.cpp (in LLVM). Although perhaps not obvious from the code, the symbol recognition depends on both the name, and the presence of a special 'builtin' attribute. Clang will add this special attribute in cases where the compiler is free to make assumptions about the semantics of the call.

For example, for this dead store, can you model in the IR that the
memory will be lost or do you need a special case in the dead store
elimination pass? Anyway, is it C++ specific or is there some
generic architecture to model dynamic allocation (I believed not)?

It is not C++ specific, although there may be some C++-specific knowledge in some of the function categories.

I ask this because in another project I'm writing a llvm-based
compiler and I'll want to apply those kind of optimizations as well.

Maybe, it could be worth to use some intrisincs, lowered by a
front-end specific pass, that model dynamic allocation primitives.
What do you think?

We may end up with something like that at some point, but for now the builtin symbol recognition has been sufficient. FWIW, we also do this same thing with other 'libc' functions, see for example lib/Transforms/Utils/SimplifyLibCalls.cpp, and some functions get special CodeGen support as well.

-Hal

> Hello.
>
> Probably this is an old enough question, sorry in this case.
>
> I was wondering which optimizations clang/llvm does (or can do,
> given restrictions from the standard)
> about the use of default new/delete operators.
>
> In particular, I'm wondering about a couple of interesting cases:
>
> 1) Can clang convert a call to default operator new to stack
> allocation, if it can prove that
> the lifetime of the allocated memory is the same as the enclosing
> scope?
> Would it be someway contrary to the standard or somewhat
> undesirable?
>
> Trivial example:
>
> void func() {
> Class *obj = new Class(args);
> obj->stuff();
> delete obj;
> }
>
> Trying to compile something simple like this, I see the couple of
> operator calls are elided, but in more complex cases in
> my code this does not happen, and I would like to know why (is the
> effect of a tradeoff with the size of the allocated memory?)
>
> 2) Can clang/llvm identify as a dead store a write to a piece of
> memory that will be delete?
>
> void func(Class *obj) {
> obj->member = 42; // Is this store removed?
> delete obj;
> }
>
> Compiling this exact code with -O3, I see the store still there.

I've been fighting similar patterns recently. The IR looks like this:

define void @_Z4funcP5Class(%struct.Class* %obj) #0 {
entry:
  %member = getelementptr inbounds %struct.Class* %obj, i64 0, i32 0
  store i32 42, i32* %member, align 4, !tbaa !0
  %isnull = icmp eq %struct.Class* %obj, null
  br i1 %isnull, label %delete.end, label %delete.notnull

delete.notnull: ; preds = %entry
  %0 = bitcast %struct.Class* %obj to i8*
  tail call void @_ZdlPv(i8* %0) #2
  br label %delete.end

delete.end: ; preds =
%delete.notnull, %entry
  ret void
}

The superfluous NULL check here is preventing the elimination of the
store. In my opinion we should optimize this based on the assumption
that a pointer can't be NULL after a load/store to it. GCC does this
and got some flak because it turned a crash into a security problem
in the Linux kernel (-fno-delete-null-pointer-checks was introduced
as a remedy). I'm not sure how to best implement that with our
current VRP infrastructure.

Interesting. Do you mean that the kernel will not internally fault if it stores to some place near 0x0? Or was the offset so large that it was going someplace valid?

In any case, I guess that we should probably do the same.

-Hal

The criteria have not yet been decided, and I think some experimentation will be necessary. In the current implementation, the conversion happens for all allocations provably less than 1024 bytes. However, we may want to cap the total size of converted mallocs in addition to, or instead of, the individual sizes. Maybe this cap should depend on how much stack memory is already being requested by the function. I'm certainly open to suggestion.

Capping the total size seems more reasonable to me, but as you said, it's necessary to experiment.
Anyway, I was thinking about a tradeoff between stack usage and speed based on the "hotness" of the code (how to call it?). For example a
relatively big allocation inside a tight loop might be lowered while an allocation of the same size at function scope would not.
Also, I think the function being recursive and/but getting transformed by tail call elimination should be factors to take into account.

What do you think?

Is there a call to the destructor after the store? If so, then we might not know that the destructor does not access the stored value.

No there's not. The issue exists with an empty class with only the member.

Thanks,
Nicola

>
> The criteria have not yet been decided, and I think some
> experimentation will be necessary. In the current implementation,
> the conversion happens for all allocations provably less than 1024
> bytes. However, we may want to cap the total size of converted
> mallocs in addition to, or instead of, the individual sizes. Maybe
> this cap should depend on how much stack memory is already being
> requested by the function. I'm certainly open to suggestion.
>

Capping the total size seems more reasonable to me, but as you said,
it's necessary to experiment.
Anyway, I was thinking about a tradeoff between stack usage and speed
based on the "hotness" of the code (how to call it?). For example a
relatively big allocation inside a tight loop might be lowered while
an allocation of the same size at function scope would not.
Also, I think the function being recursive and/but getting
transformed by tail call elimination should be factors to take into
account.

What do you think?

These all seem like good ideas; once the initial support lands upstream, please play with it. I think that user feedback is really going to be the only way to set these various thresholds.

-Hal

These all seem like good ideas; once the initial support lands upstream, please play with it. I think that user feedback is really going to be the only way to set these various thresholds.

Of course I'll play with it! Please let us know on this thread when the commit comes out.

-Hal

Thank you,
Nicola

Hal Finkel wrote:

Capping the total size seems more reasonable to me, but as you said,
it's necessary to experiment.
Anyway, I was thinking about a tradeoff between stack usage and speed
based on the "hotness" of the code (how to call it?). For example a
relatively big allocation inside a tight loop might be lowered while
an allocation of the same size at function scope would not.
Also, I think the function being recursive and/but getting
transformed by tail call elimination should be factors to take into
account.

What do you think?

These all seem like good ideas; once the initial support lands upstream, please play with it. I think that user feedback is really going to be the only way to set these various thresholds.

Stack space constraints can vary so greatly in practice that it might be a good idea to make these thresholds user-adjustable through compiler switches, pragmas or function attributes.

The default thresholds for optimizing allocations in non-leaf functions should probably be very low, as you otherwise risk blowing up programs like e.g. deeply recursive parsers or network servers with a million lightweight threads.

To monitor the effects of such an optimization it would be nice if the compiler could optionally output some information on the minimum and maximum stack usage of functions.

I also suspect that the effectiveness of this optimization will depend a lot on whether functions can be opportunistically inlined for it.

- Stephan