RFC: We should stop merging allocas in the inliner

Chris added alloca merging in the inliner a looooong time ago, 2009. The reason he added it was because at the time we didn’t do stack coloring and without it we had serious stack size problems in LLVM.

Since then, a few rather important things have changed:

  • We have reasonably powerful stack coloring support in the backend based on lifetime markers
  • Clang (the primary frontend I’m aware of that hit issues with this) is now emitting lifetime markers for essentially everything
  • Clang even has front-end based -O0 stack reuse tricks
  • AliasAnalysis has become even more important (extended to codegen time etc) and relies heavily on distinguishing between allocas.
  • We have lots of tools like the sanitizers that directly reason about allocas to catch bugs in user programs.

The first two changes pretty much completely obviate the need for alloca merging as far as I’m aware, and the second two changes make the information loss caused by this practice, IMO, really bad.

Even if there are problems exposed by turning off alloca merging in the inliner, they are almost certainly deficiencies in stack coloring, Clang, or other parts of the optimizer that we should fix rather than continue to ignore.

Thoughts? The code changes are easy and mechanical. My plan would be:

  1. Add a flag to turn this off.
  2. Run benchmarks with flag to make sure things are actually working.
  3. Change default for the flag, make sure chaos doesn’t ensue.
  4. Delete the (quite substantial) complexity associated with this and the flag.
  5. Profit!

-Chandler

Chris added alloca merging in the inliner a looooong time ago, 2009. The
reason he added it was because at the time we didn't do stack coloring and
without it we had serious stack size problems in LLVM.

Since then, a few rather important things have changed:
- We have reasonably powerful stack coloring support in the backend based
on lifetime markers
- Clang (the primary frontend I'm aware of that hit issues with this) is
now emitting lifetime markers for essentially everything
- Clang even has front-end based -O0 stack reuse tricks
- AliasAnalysis has become even more important (extended to codegen time
etc) and relies heavily on distinguishing between allocas.
- We have lots of tools like the sanitizers that directly reason about
allocas to catch bugs in user programs.

The first two changes pretty much completely obviate the need for alloca
merging as far as I'm aware, and the second two changes make the
information loss caused by this practice, IMO, really bad.

Even if there are problems exposed by turning off alloca merging in the
inliner, they are almost certainly deficiencies in stack coloring, Clang,
or other parts of the optimizer that we should fix rather than continue to
ignore.

Thoughts? The code changes are easy and mechanical. My plan would be:

There is one caveat: stop doing stack merging in inliner will change the
inliner cost analysis which may have affect inlining decisions and
performance in an unexpected way. For instance, the allocatedSize estimate
won't be as accurate due to double counting.

1) Add a flag to turn this off.
2) Run benchmarks with flag to make sure things are actually working.
3) Change default for the flag, make sure chaos doesn't ensue.
4) Delete the (quite substantial) complexity associated with this and the
flag.

The code handling the merging seems to be in an isolated place with <100
lines of code. While I agree this change may be something good to do in
principle -- are there any practical reason for doing this? you mentioned
information loss (aliasing, sanitizer etc) -- are there any real issues
caused by the merging (with tracking bugs)?

David

Would we want to wrap any of the allocas we do import while inlining in extra lifetime markers? (not sure if nested lifetime markers are supported/meaningful - maybe just in the case where there are no lifetime markers already?) So that non-Clang IR clients still get stack reuse for inlined stack slots?

I want to double check, but I believe it already does exactly that. If not, yes, we should synthesize lifetime markers just to be safe.

The existing lifetime start/end is not very well defined (by spec, or by
code) in what it means, so you could have nested lifetime markers if you
wanted. If you made the spec well-defined, they would be meaningful (for
loops, etc).

There are a number of open bugs/complaints about lifetime markers and the
fact that the scope is not well defined (the spec says "This intrinsic
indicates that before this point in the code, the value of the memory
pointed to by ptr is dead. This means that it is known to never be used and
has an undefined value. A load from the pointer that precedes this
intrinsic can be replaced with 'undef'.)

"Before" is not a well-defined term :slight_smile: (the end code says after, which has
similar problems)

Trivial case:

Two block loop consists, lifetime maker at beginning

A

^
   \
   >
  >

v /
B

Lifetime marker is at top of A.

Is B before A? (if not, are we using dominates as the definition of
"before", because B is a predecessor of A)
Does the answer change if i put a jump to B somewhere above A?
If we use dominates, and i put a jump there, i can make A and B dominator
tree siblings. Is the pointer now invalid? :slight_smile:

etc

I suspect what is really wanted is "i want lifetime markers to specify that
this thing has single-iteration scope, function scope, or SESE/SEME region
scope" (or something).

The existing lifetime start/end is not very well defined (by spec, or by
code) in what it means, so you could have nested lifetime markers if you
wanted. If you made the spec well-defined, they would be meaningful (for
loops, etc).

There are a number of open bugs/complaints about lifetime markers and the
fact that the scope is not well defined (the spec says "This intrinsic
indicates that before this point in the code, the value of the memory
pointed to by ptr is dead. This means that it is known to never be used
and has an undefined value. A load from the pointer that precedes this
intrinsic can be replaced with 'undef'.)

"Before" is not a well-defined term :slight_smile: (the end code says after, which
has similar problems)

Trivial case:

Two block loop consists, lifetime maker at beginning

A
> ^
> \
> >
> >
v /
B

Lifetime marker is at top of A.

Is B before A? (if not, are we using dominates as the definition of
"before", because B is a predecessor of A)
Does the answer change if i put a jump to B somewhere above A?
If we use dominates, and i put a jump there, i can make A and B dominator
tree siblings. Is the pointer now invalid? :slight_smile:

etc

I suspect what is really wanted is "i want lifetime markers to specify
that this thing has single-iteration scope, function scope, or SESE/SEME
region scope" (or something).

Also having lifetime.start marker seems unnecessary and it makes live range
analysis less precise actually. One local var can have one single end
marker and multiple implicit start markers (first uses). The end marker
should post-dominate all uses. Having one start marker is like defining a
least common dominator for all the implicit starts which make the resulting
live range strictly 'larger' than necessary.

David

The existing lifetime start/end is not very well defined (by spec, or by
code) in what it means, so you could have nested lifetime markers if you
wanted. If you made the spec well-defined, they would be meaningful (for
loops, etc).

There are a number of open bugs/complaints about lifetime markers and the
fact that the scope is not well defined (the spec says "This intrinsic
indicates that before this point in the code, the value of the memory
pointed to by ptr is dead. This means that it is known to never be used
and has an undefined value. A load from the pointer that precedes this
intrinsic can be replaced with 'undef'.)

"Before" is not a well-defined term :slight_smile: (the end code says after, which
has similar problems)

Trivial case:

Two block loop consists, lifetime maker at beginning

A
> ^
> \
> >
> >
v /
B

Lifetime marker is at top of A.

Is B before A? (if not, are we using dominates as the definition of
"before", because B is a predecessor of A)
Does the answer change if i put a jump to B somewhere above A?
If we use dominates, and i put a jump there, i can make A and B dominator
tree siblings. Is the pointer now invalid? :slight_smile:

etc

I suspect what is really wanted is "i want lifetime markers to specify
that this thing has single-iteration scope, function scope, or SESE/SEME
region scope" (or something).

Also having lifetime.start marker seems unnecessary and it makes live
range analysis less precise actually. One local var can have one single
end marker and multiple implicit start markers (first uses).

The original goal, AFAIK, of lifetime start/end was to say "the frontend
put the alloca/etc here, but that's not really where the lifetime starts,
so please ignore anything between the thing you think is the start and the
real start".

Extend this to structures, where parts of a structure may get used but
other parts still dead for a while.

The end marker should post-dominate all uses. Having one start marker is
like defining a least common dominator for all the implicit starts which
make the resulting live range strictly 'larger' than necessary.

Yes, it is possible to track and compute more precise ranges than the
lifetime start markers give you, *assuming* the actual placement of
variables in the IR is accurate to the original block scope (IE nothing
generates uninitialized loads to start), and you compute the ranges before
anything messes that up (IE moves the uses up) :).

Right now, i believe the lifetime start markers are what block things from
moving those loads up, because they just get deleted as undefined :slight_smile:

In any case, i think this is going to derail chandler's question if we
continue on this thread and not a new one :slight_smile:

--Dan

Thanks. ;] I really don’t want to deep dive into ways that lifetime markers are broken or could be improved unless they are so broken that they won’t suffice to replace the merging done by the inliner.

AFAICT, the merging done by the inliner is quite weak and the lifetime markers are already going to be more powerful (and in fact, we turn the merging off in a pile of cases and rely pretty much exclusively on the lifetime markers there).

I just want to note that there's currently a miscompile in C code with lifetime markers:
  http://lists.llvm.org/pipermail/cfe-dev/2016-July/050066.html

Depending on how we fix that, we could end up being more pessimistic about when markers are emitted. I suggest we fix that before benchmarking this change.

Yeah, this is a reverse variant of the example i posted to david blaikie :slight_smile:

Assuming y’all are certain this is supposed to do something sensible in C, outside of “stop emitting lifetime markers in functions with jumps past variable initialization”, getting a conservatively right answer is … non-trivial AFAIK.

That shouldn’t impact this change.

The inliner will emit inlining based lifetime markers all on its own for any static allocas that the frontend doesn’t annotate. So we could turn off all lifetime markers in Clang and the merging here should still not be necessary to get the stack savings.

(And I have now verified that yes, the inliner inserts these markers…)

Ah, thanks, I’d misunderstood up-thread somehow. In that case I agree that it’s independent.

Chris added alloca merging in the inliner a looooong time ago, 2009. The reason he added it was because at the time we didn't do stack coloring and without it we had serious stack size problems in LLVM.

Since then, a few rather important things have changed:
- We have reasonably powerful stack coloring support in the backend based on lifetime markers
- Clang (the primary frontend I'm aware of that hit issues with this) is now emitting lifetime markers for essentially everything
- Clang even has front-end based -O0 stack reuse tricks
- AliasAnalysis has become even more important (extended to codegen time etc) and relies heavily on distinguishing between allocas.
- We have lots of tools like the sanitizers that directly reason about allocas to catch bugs in user programs.

The first two changes pretty much completely obviate the need for alloca merging as far as I'm aware, and the second two changes make the information loss caused by this practice, IMO, really bad.

Even if there are problems exposed by turning off alloca merging in the inliner, they are almost certainly deficiencies in stack coloring, Clang, or other parts of the optimizer that we should fix rather than continue to ignore.

Thoughts? The code changes are easy and mechanical. My plan would be:

I like the principal idea.

1) Add a flag to turn this off.
2) Run benchmarks with flag to make sure things are actually working.
3) Change default for the flag, make sure chaos doesn't ensue.

What does this imply? Just run-time? I’d like to see stack size data w/ and w/o the change and - just in case - an investigation/fix of the outliers. Any thoughts about compile-time and code size?

You probably also want to collect the data across more than one platform.

4) Delete the (quite substantial) complexity associated with this and the flag.

You might need to allow target specific behavior for some grace period.

Chris added alloca merging in the inliner a looooong time ago, 2009. The reason he added it was because at the time we didn’t do stack coloring and without it we had serious stack size problems in LLVM.

Since then, a few rather important things have changed:

  • We have reasonably powerful stack coloring support in the backend based on lifetime markers
  • Clang (the primary frontend I’m aware of that hit issues with this) is now emitting lifetime markers for essentially everything
  • Clang even has front-end based -O0 stack reuse tricks
  • AliasAnalysis has become even more important (extended to codegen time etc) and relies heavily on distinguishing between allocas.
  • We have lots of tools like the sanitizers that directly reason about allocas to catch bugs in user programs.

The first two changes pretty much completely obviate the need for alloca merging as far as I’m aware, and the second two changes make the information loss caused by this practice, IMO, really bad.

Even if there are problems exposed by turning off alloca merging in the inliner, they are almost certainly deficiencies in stack coloring, Clang, or other parts of the optimizer that we should fix rather than continue to ignore.

Thoughts? The code changes are easy and mechanical. My plan would be:

I like the principal idea.

  1. Add a flag to turn this off.
  2. Run benchmarks with flag to make sure things are actually working.
  3. Change default for the flag, make sure chaos doesn’t ensue.

What does this imply? Just run-time? I’d like to see stack size data w/ and w/o the change and - just in case - an investigation/fix of the outliers.

Sure, I’ll do both some benchmarks and also look at some stack sizes.

Any thoughts about compile-time and code size?

I think compile-time regressions are certainly possible with heavier use of AA and stack coloring, but we’d likely be hitting those through other patterns than just this one already. It doesn’t seem likely this will really cause much churn there, but certainly if reports come in, we will have a flag that we can toggle until things are fixed.

You probably also want to collect the data across more than one platform.

I’ll certainly collect it for the platforms I have access to and such, but hopefully those interested in other platforms and worried about stack space can also do some testing. That’s why I want to add a flag so its super easy to check whether this regresses something you care about.

  1. Delete the (quite substantial) complexity associated with this and the flag.
    You might need to allow target specific behavior for some grace period.

I really don’t think this warrants a long grace period much less target specific behavior.

The goal is simplification. We have lots of infrastructure for a target to control how stack coloring and layout is done. Honestly, I suspect most of the stack objects already go down this path. It is only array type allocas that we even bother with this merging for right now, so I’m not expecting any dramatic changes.

Clearly, if folks report problems, we’ll turn it back on and have to sort out what happened, but I think we can take a pretty simple approach to get from here to there…

Seeing no big concerns raised with this direction…

  1. Add a flag to turn this off.

https://reviews.llvm.org/D23052

Will proceed from there.

I don’t think I have seen an answer to the first email from David in this thread?

Sorry, I missed some of the comments when I got distracted by the tangent Danny and David went down, will respond to the rest shortly.

Sorry I missed these comments in my first read through David.

There is the possibility of this, but I think it is relatively unlikely for a few reasons.

The first is that I don’t think the alloca merging is helping that much here. Specifically, we only merge certain kinds of allocas and we only merge them in fairly limited cases. So if the thresholds were very sensitive to this, I would expect us to have seen these thresholds blocking things more often than we have.

Also, as you mention, there are two places we might hit this. One is due to the increased number of alloca instructions causing skew in the inline cost. I think we should “fix” this by stopping counting static alloca instructions for the purpose of inline cost estimation. I think this will be a net win. The alloca instructions are completely synthetic when static. All of them will be folded into a single stack adjustment that is even shared with register spills etc. I think we should model them as free. (Clearly, dynamic alloca instructions are different here!) But I don’t think this is likely enough to be urgent to fix. I think we could do both changes in parallel.

The other limit is the allocated size limit, and I think that is the more likely issue (it sounds like it was the one you were worried about as well). This one might be an issue, but I also think we can adjust it pretty freely. The goal of this limit, from my recollection of when it went in, is more to avoid run-away stack bloating, and the value of the limit is somewhat arbitrary. And it only impacts inlining into recursive functions. So I’m less worried about this in general. If it proves to be a problem in practice, we can do something about it, and if that happens it would probably be good to know because we shouldn’t be relying on merging to model this anyways.

Yes, its complexity which is better addressed by another layer. Every piece of complexity, especially in something like the inliner, makes it harder to change and reason about. I’m trying to minimize and isolate the complex parts of the iniliner in preparation to porting it to the new pass manager. This seemed like low-hanging fruit.

I don’t have any, but I also don’t think we should wait for bugs to come up to make principled changes to the compiler, especially those that are simplifying!

Sorry I missed these comments in my first read through David.

Thoughts? The code changes are easy and mechanical. My plan would be:

There is one caveat: stop doing stack merging in inliner will change the
inliner cost analysis which may have affect inlining decisions and
performance in an unexpected way. For instance, the allocatedSize estimate
won't be as accurate due to double counting.

There is the possibility of this, but I think it is relatively unlikely
for a few reasons.

The first is that I don't think the alloca merging is helping that much
here. Specifically, we only merge certain kinds of allocas and we only
merge them in fairly limited cases. So if the thresholds were very
sensitive to this, I would expect us to have seen these thresholds blocking
things more often than we have.

Also, as you mention, there are two places we might hit this. One is due
to the increased number of alloca instructions causing skew in the inline
cost. I think we should "fix" this by stopping counting *static* alloca
instructions for the purpose of inline cost estimation. I think this will
be a net win. The alloca instructions are completely synthetic when static.
All of them will be folded into a single stack adjustment that is even
shared with register spills etc. I think we should model them as free.
(Clearly, *dynamic* alloca instructions are different here!) But I don't
think this is likely enough to be urgent to fix. I think we could do both
changes in parallel.

Yes, I agree with this. In fact, not only should we not model static alloca
as 'cost', if we have not already model them as part of the call overhead
(stack pointer adjustment in prologue), we should probably also model all
static allocas collectively in the callee as potential savings after
inlining. If you don't have time to get to it (at least the first part),
Easwaran can probably help with this adjustment.

The other limit is the allocated size limit, and I think that is the more
likely issue (it sounds like it was the one you were worried about as
well). This one might be an issue, but I also think we can adjust it pretty
freely. The goal of this limit, from my recollection of when it went in, is
more to avoid run-away stack bloating, and the value of the limit is
somewhat arbitrary. And it only impacts inlining into recursive functions.
So I'm less worried about this in general. If it proves to be a problem in
practice, we can do something about it, and if that happens it would
probably be good to know because we shouldn't be relying on merging to
model this anyways.

The current limit (for recursive caller) is indeed arbitrary, but I am more
worried about a more general case which applies to any deep call chains: we
may want to implement heuristic to throttle inlining in general due to
excessive stack frame growth. For this reason, the tracking of stack size
need to match closely to actual stack usage -- otherwise we end up in
either missing inlinings or runtime errors (due to out of stack problem).

One way to solve this without requiring merging is to track the current
node's frame size by combining the caller's original stack size plus the
max of its callee's frame size (not sum).

1) Add a flag to turn this off.
2) Run benchmarks with flag to make sure things are actually working.
3) Change default for the flag, make sure chaos doesn't ensue.
4) Delete the (quite substantial) complexity associated with this and
the flag.

The code handling the merging seems to be in an isolated place with <100
lines of code. While I agree this change may be something good to do in
principle -- are there any practical reason for doing this?

Yes, its complexity which is better addressed by another layer. Every
piece of complexity, especially in something like the inliner, makes it
harder to change and reason about. I'm trying to minimize and isolate the
complex parts of the iniliner in preparation to porting it to the new pass
manager. This seemed like low-hanging fruit.

you mentioned information loss (aliasing, sanitizer etc) -- are there any
real issues caused by the merging (with tracking bugs)?

I don't have any, but I also don't think we should wait for bugs to come
up to make principled changes to the compiler, especially those that are
simplifying!

thanks. I just wanted to understand the motivation for the change, which
you provided.

David

Sure, if Easwaran can look at that independently, it’d be great.

So there are two ways I can see the threshold being used.

One is to prevent completely nuts run-away stack growth. That use case can be served with a fairly arbitrary approximation of stack growth and an arbitrary threshold.

But the other use case is, as you say, to be reasonably careful about inlining “too much” and hurting stack size. There, I agree with you that we need a fairly precise way to model this and I don’t think LLVM has one at the moment.

My belief is that at best we are currently addressing the first of these goals, and that the difference between merging or not merging could be easily accomodated by changes to the threshold.

I’m happy for folks to look at going after the second goal, but the merging thing isn’t going to be nearly enough so I don’t think that should really hold up this move.