RFC: Switching to the new pass manager by default

Greetings everyone!

The new pass manager is getting extremely close to the point where I’m not aware of any significant outstanding work needed, and I’d like to see what else would be needed to enable it by default. Here are the current functionality I’m aware of outstanding:

  1. Does not do non-trivial loop unswitching. Majority of this is in https://reviews.llvm.org/D34200 but will need one or two small follow-ups.

  2. Currently, sanitizers don’t work correctly with it. Thanks to the work of others, the missing infrastructure has been added and I’ll send a patch to wire this up this week.

  3. Missing support for ‘optnone’. I’ve been working on this, but the existing testing wasn’t as thorough as I wanted, so it is going slowly. I’ve got about 1/4 of this implemented and should have patches this week or next.

  4. Missing opt-bisect (or similar) facility. This looks pretty trivial to add, but I’ve not even started. If anyone is interested in it, go for it. We might even be able to do something simpler using the generic debug counters and get equivalent functionality.

… that’s it?

Optimization quality / run-time performance:

  • We’ve been using it at Google extensively and are very happy with the optimization quality. Benchmarks look very good here.
  • More data from other users would be important.
  • You can try it out with -fexperimental-new-pass-manager to Clang

Compile-time performance:

  • Sometimes much better due to cached analyses.
  • Sometimes worse, typically due to more / different inlining in turn running main pipeline (GVN + InstCombine) more times or over more code.
  • Overall somewhat a wash, but the increased compile times typically due to the optimizer “trying” harder, so not too concerning on our end.
  • Again, more feedback from other users good: -fexperimental-new-pass-manager to Clang

Once the four missing things land, I’ll also happily work on collecting some of the basics on the test-suite and CTMark. But I suspect more “in the wild” data would really be useful here given the significance of the change.

Thoughts? What else (beyond the four items above and feedback on run-time and compile-time) would folks like to see?

Once this happens, I’ll also be preparing some batch, mechanical updates to the test suite to primarily use the new pass manager. Also there is lots of documentation updates that will be needed here.

-Chandler

PS: I’ll be sending a note to cfe-dev as a “heads up” about this discussion as in some ways, the default flip is mostly a Clang default flip. But hopefully our doc updates will trigger this being “perceived” as the default for other frontends, and I’ll try to reach out to other major frontends as well (Swift and Rust are on my radar, and I’ve already started talking with Philip Reames about their Falcon JIT).

Hi Chandler,

I don’t track the head revisions, instead I do a “Big Bang” update with each 6-month formal release of the LLVM suite, and I am now at v5.0.

Is the implementation of the new pass manager in LLVM v5.0 suitable for me to usefully experiment with it so as to provide feedback, or should I wait and try it out with v6.0?

Thanks,

MartinO

Hi Chandler,

I don’t track the head revisions, instead I do a “Big Bang” update with each 6-month formal release of the LLVM suite, and I am now at v5.0.

Is the implementation of the new pass manager in LLVM v5.0 suitable for me to usefully experiment with it so as to provide feedback, or should I wait and try it out with v6.0?

Unfortunately, some very important fixes landed after 5.0, and the risk was too high to cherry-pick them to fix an off-by-default feature. LLVM since the start of August is in a pretty stable state for testing this stuff.

-Chandler

Thanks Chandler, I guess I’ll have to wait a few more months so :wink:

MartinO

I'd say the complete lack of documentation available from the http://llvm.org/docs landing page is something that should be remedied before the default gets flipped, especially if it impacts running passes via opt.

On a related note, what's the expected timeline for removing the old pass manager?

Could you share some details/numbers?

Michael

I just tried putting a relatively recent version of our merge branch (r316139 + our local changes) through a small subset of the PS4 testing with -fexperimental-new-pass-manager enabled. In general, most tests passed :). There are a few failures I’d need to look at in more detail though. The most glaring are all our tests related to generating coverage info:

$ cat 1.cpp
void foo(){}
$ ./build/bin/clang -c -fprofile-instr-generate -fcoverage-mapping 1.cpp
$ ./build/bin/clang -c -fprofile-instr-generate -fcoverage-mapping 1.cpp -fexperimental-new-pass-manager
instrprof failed to lower an increment
UNREACHABLE executed at /home/greg/public_git/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp:5893!

which I’m assuming is the same thing as was reported in https://bugs.llvm.org/show_bug.cgi?id=33773 . This would definitely be a prerequisite for us to consider flipping the default.

There are a few tests looking for specific codegen that are seeing something different. At a glance, it looks like the output from clang’s -O2 looks roughly similar in some small examples with and without the new PM, but clang’s -O1 is producing very different output. Whether this is an issue, I can’t say for sure (I’ve not tried benchmarking the -O1s against each other yet). Is this expected?

I’ve also got an example of a specific file that shows quite a large compile time increase with the new PM:

Old PM:

—User Time— --System Time-- --User+System-- —Wall Time— — Name —
15.8125 ( 96.0%) 0.2969 ( 95.0%) 16.1094 ( 96.0%) 16.1059 ( 96.1%) Code Generation Time
0.6563 ( 4.0%) 0.0156 ( 5.0%) 0.6719 ( 4.0%) 0.6609 ( 3.9%) LLVM IR Generation Time
16.4688 (100.0%) 0.3125 (100.0%) 16.7813 (100.0%) 16.7668 (100.0%) Total

New PM:

—User Time— --System Time-- --User+System-- —Wall Time— — Name —
67.0781 ( 99.1%) 0.3438 ( 95.7%) 67.4219 ( 99.0%) 67.4543 ( 99.0%) Code Generation Time
0.6406 ( 0.9%) 0.0156 ( 4.3%) 0.6563 ( 1.0%) 0.6644 ( 1.0%) LLVM IR Generation Time
67.7188 (100.0%) 0.3594 (100.0%) 68.0781 (100.0%) 68.1187 (100.0%) Total

It’s not an example I can easily share (and also not one I’m easily able to verify in its present state on a branch without our local changes to make sure they’re not responsible somehow), but if it’s of interest I can see whether I can reduce it to something shareable and raise a bug.

If I get a chance, I’ll try putting this through some PS4 game codebases to see how build time and run time performance compare there, but I’m one of the few manning the fort while most of the rest of the team is out having a nice time at the Developers’ Meeting this week, so I may not get that chance for a bit.

Cheers,
-Greg

I looked at a few more of my earlier results in more detail. In particular, from one of our test suites that produces some interesting compile-time and memory usage data for clang on Windows hosts.

I did a comparison between “-O2 -g” and “-O2 -g -fexperimental-new-pass-manager”. As you’ll see in the attached graphs, in general the suite contains some relatively small compilation units, mostly taking <5 seconds each to compile although there are a few larger ones in there too. I’m using 64-bit clang.exe binary targeting the PS4 triple. I’ve limited the test harness to run a single process at a time.
The first graph shows all of data points. I’m plotting wall time along the X axis and PeakWorkingSetSize (see https://msdn.microsoft.com/en-us/library/windows/desktop/ms684877%28v=vs.85%29.aspx?f=255&MSPPError=-2147217396 ) along the Y axis. As you can see there are a few interesting outliers with the new pass manager on both axes. I’d like to do more investigation here and possibly raise some bugs if I can make shareable reproducers.

extents.png

The next graph is the same data, but I’ve zoomed right in on the bottom left corner to see that cluster with the greatest concentration of points in more detail:
zoomed.png

The clusters of data points generally map quite well to each other here. There does seem to be a visible trend that the New PM is using more memory. There’s not enough data here to extrapolate any further (remember we’re talking about <2 second compiles here), but it leaves me interested as to what the trend would be further out; in particular when applied to large LTO workloads where paging can become a big issue. Essentially, is this a small, relatively constant overhead, or does it scale with overall memory usage? I guess it’s quite possible I’m seeing the effects of different inlining strategies. This isn’t necessarily a blocker either way, but it would be an interesting question to know the answer to.

-Greg

FWIW, I've been running with the new PM on on several PS4 codebases
(the one I have available) for the last year or such, and I didn't see
any major compile time/runtime performance increase (or decrease,
FWIW). The set of games I have access to may be slightly different
than yours, so I suppose it's still a good thing to give it a shot,
but I wanted to throw yet another datapoint on the table.

I discussed this yesterday with Chandler but I'm going to echo my thoughts here.
I think we might consider doing at least one round of fuzzing before
switching the flag (in particular, as loop unswitch has been rewritten
entirely). Zhendong has already a lot of infrastructure for this, and
it seems it's just matter of changing a flag.

For the `opt-bisect` I think we could just reuse Danny's debug counter
infrastructure (and/or port the old functionality straight to the new
PM). I don't feel particularly strong about this.

Thanks,

I just tried putting a relatively recent version of our merge branch (r316139 + our local changes) through a small subset of the PS4 testing with -fexperimental-new-pass-manager enabled. In general, most tests passed :). There are a few failures I’d need to look at in more detail though. The most glaring are all our tests related to generating coverage info:

$ cat 1.cpp
void foo(){}
$ ./build/bin/clang -c -fprofile-instr-generate -fcoverage-mapping 1.cpp
$ ./build/bin/clang -c -fprofile-instr-generate -fcoverage-mapping 1.cpp -fexperimental-new-pass-manager
instrprof failed to lower an increment
UNREACHABLE executed at /home/greg/public_git/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp:5893!

which I’m assuming is the same thing as was reported in https://bugs.llvm.org/show_bug.cgi?id=33773 . This would definitely be a prerequisite for us to consider flipping the default.

Yeah, this is essentially the same issue as the sanitizers and one of the remaining things I’m poking forward.

There are a few tests looking for specific codegen that are seeing something different. At a glance, it looks like the output from clang’s -O2 looks roughly similar in some small examples with and without the new PM, but clang’s -O1 is producing very different output. Whether this is an issue, I can’t say for sure (I’ve not tried benchmarking the -O1s against each other yet). Is this expected?

Yeah, I’ve not really closely tracked the pipeline differences between the old PM and the new PM at O1. My best guess is that the old PM uses the always inliner and the new PM uses a full inliner.

If the compile time of O1 is regressing a lot, its easy to try and converge these.

It’s not an example I can easily share (and also not one I’m easily able to verify in its present state on a branch without our local changes to make sure they’re not responsible somehow), but if it’s of interest I can see whether I can reduce it to something shareable and raise a bug.

FWIW, there are some things that can increase compile time. The thing we’ve seen most often is when the inliner suddenly becomes more powerful and it generates about 2x the amount of total code. Then you see the time in the code generator go up by 2x. That seems to match your numbers, but definitely feel free to file a bug if this is problematic with some more information.

I looked at a few more of my earlier results in more detail. In particular, from one of our test suites that produces some interesting compile-time and memory usage data for clang on Windows hosts.

I did a comparison between “-O2 -g” and “-O2 -g -fexperimental-new-pass-manager”. As you’ll see in the attached graphs, in general the suite contains some relatively small compilation units, mostly taking <5 seconds each to compile although there are a few larger ones in there too. I’m using 64-bit clang.exe binary targeting the PS4 triple. I’ve limited the test harness to run a single process at a time.
The first graph shows all of data points. I’m plotting wall time along the X axis and PeakWorkingSetSize (see https://msdn.microsoft.com/en-us/library/windows/desktop/ms684877%28v=vs.85%29.aspx?f=255&MSPPError=-2147217396 ) along the Y axis. As you can see there are a few interesting outliers with the new pass manager on both axes. I’d like to do more investigation here and possibly raise some bugs if I can make shareable reproducers.

FWIW, the compile time regressions are a bit more interesting to me unless the memory regression is unusually problematic for you or someone else. See below…

extents.png

The next graph is the same data, but I’ve zoomed right in on the bottom left corner to see that cluster with the greatest concentration of points in more detail:
zoomed.png

The clusters of data points generally map quite well to each other here. There does seem to be a visible trend that the New PM is using more memory. There’s not enough data here to extrapolate any further (remember we’re talking about <2 second compiles here), but it leaves me interested as to what the trend would be further out; in particular when applied to large LTO workloads where paging can become a big issue. Essentially, is this a small, relatively constant overhead, or does it scale with overall memory usage? I guess it’s quite possible I’m seeing the effects of different inlining strategies. This isn’t necessarily a blocker either way, but it would be an interesting question to know the answer to.

This is definitely something I’m expecting, but something I want us to watch closely.

The new PM caches analyses to avoid recomputing them. The good part of this is reduced recomputation of analyses. The bad part is using more memory. So the increase in memory is expected.

However, you might expect to see more reduction in compile time. Sadly, these reductions are a bit slow for a few bad reasons:

  1. The code generator isn’t using the new pass manager yet. =[ So it throws away cached results and recomputes them. This isn’t any different from the old pass manager, but still frustrating because we spent peak memory on those results and didn’t benefit from it.

  2. A lot of the compile time goes to Clang and the code generator, hiding any improvements.

So, provided the memory increases are “OK” (IE, not causing significant resource starvation, or superlinear memory growth) I’m not too scared by them. I’m more interested in checking why there are compile time regressions.

Hi Chandler,

Here at QuarksLab we're also doing an update after each release so I
cannot test without significant effort, still I'm interested in the
answer to the following question:

In our compiler, we're sometime spawning a (legacy) passmanager in a pass (so
within an exisiting passmanager) when we want some dynamic pass
scheduling based on some analysis or user input.

Legacy passmanager forces the order between function pass and module
pass, they can't be interleaved properly, so we actually had to spawn
several PM instances one after the other to get a fine grain control over
the pass ordering.

How does the new pass manager interacts with these two use cases?

Hi Serge,

Hi,

Hi Chandler,

I ran the LNT benchmarks and SPEC2k6.train on AArch64 Cortex-A57. I used revisions: Clang 316561, LLVM 316563.

Options: -O3 -mcpu=cortex-a57 -fomit-frame-pointer -fexperimental-new-pass-manager

Regressions: execution time increase

LNT

MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow 1018.58%

MultiSource/Benchmarks/Fhourstones/fhourstones 9.06%

MultiSource/Benchmarks/Ptrdist/yacr2/yacr2 7.23%

MultiSource/Benchmarks/Olden/perimeter/perimeter 6.87%

MultiSource/Benchmarks/MiBench/consumer-typeset/consumer-typeset 6.02%

MultiSource/Benchmarks/Trimaran/enc-pc1/enc-pc1 5.59%

MultiSource/Benchmarks/ASC_Sequoia/AMGmk/AMGmk 5.03%

SPEC2k6

453.povray 17.11%

482.sphinx3 3.44%

444.namd 2.89%

Improvements: execution time decrease

LNT

MultiSource/Benchmarks/BitBench/uudecode/uudecode -50.90%

SingleSource/Benchmarks/Adobe-C++/loop_unroll -27.75%

SingleSource/Benchmarks/Misc/perlin -21.35%

MultiSource/Benchmarks/Olden/em3d/em3d -19.12%

MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4 -8.58%

SingleSource/Benchmarks/McGill/chomp -6.33%

MultiSource/Benchmarks/sim/sim -5.41%

MultiSource/Applications/ClamAV/clamscan -3.11%

MultiSource/Benchmarks/TSVC/Symbolics-dbl/Symbolics-dbl -2.81%

SPEC2k6

429.mcf -5.18%

473.astar -2.65%

400.perlbench -1.90%

There are also code sizes increases/decreases. The maximum increase is 18.98%. The maximum decrease is 25.65%.

Thanks,

Evgeny Astigeevich

How real is this? -Hal

Greetings everyone!

The new pass manager is getting extremely close to the point where I'm not
aware of any significant outstanding work needed, and I'd like to see what
else would be needed to enable it by default. Here are the current
functionality I'm aware of outstanding:

1) Does not do non-trivial loop unswitching. Majority of this is in
https://reviews.llvm.org/D34200 but will need one or two small follow-ups.

2) Currently, sanitizers don't work correctly with it. Thanks to the work
of others, the missing infrastructure has been added and I'll send a patch
to wire this up this week.

3) Missing support for 'optnone'. I've been working on this, but the
existing testing wasn't as thorough as I wanted, so it is going slowly.
I've got about 1/4 of this implemented and should have patches this week or
next.

4) Missing opt-bisect (or similar) facility. This looks pretty trivial to
add, but I've not even started. If anyone is interested in it, go for it.
We might even be able to do something simpler using the generic debug
counters and get equivalent functionality.

... that's it?

Missing support of 'print-after-all' or 'print-after-passX' is another
major usability loss.

Regarding the default switch, maybe do it in two steps:

1) Switch the default for PGO build first
2) Switch the default for all modes.

David

Why? -Hal

Greetings everyone!

The new pass manager is getting extremely close to the point where I'm
not aware of any significant outstanding work needed, and I'd like to see
what else would be needed to enable it by default. Here are the current
functionality I'm aware of outstanding:

1) Does not do non-trivial loop unswitching. Majority of this is in
https://reviews.llvm.org/D34200 but will need one or two small
follow-ups.

2) Currently, sanitizers don't work correctly with it. Thanks to the work
of others, the missing infrastructure has been added and I'll send a patch
to wire this up this week.

3) Missing support for 'optnone'. I've been working on this, but the
existing testing wasn't as thorough as I wanted, so it is going slowly.
I've got about 1/4 of this implemented and should have patches this week or
next.

4) Missing opt-bisect (or similar) facility. This looks pretty trivial to
add, but I've not even started. If anyone is interested in it, go for it.
We might even be able to do something simpler using the generic debug
counters and get equivalent functionality.

... that's it?

Missing support of 'print-after-all' or 'print-after-passX' is another
major usability loss.

Regarding the default switch, maybe do it in two steps:

1) Switch the default for PGO build first
2) Switch the default for all modes.

Why?

1) PGO users benefit from the PM change the most
2) I suppose incremental changes are always better and causes fewer churns?

David