I quickly checked the execution profile. It is real. The code changed significantly. A number of the hottest regions changed. I’ll compare IRs.
JFYI FreeBench/fourinarow time graph: http://lnt.llvm.org/db_default/v4/nts/graph?highlight_run=76922&plot.1604615=1349.1604615.3
Its graph in our LNT is more stable.
Thanks. Obviously a 1000% execution performance regression seems problematic. -Hal
I’m somewhat nervous about having the PGO builds differ so much from the non-PGO builds. My thought is that: (Significant) regressions in execution performance or compile time need to be fixed regardless. Maybe there’s some flexibility on code-size regressions (e.g., such that we could delay the transition for -Oz if necessary)? -Hal
The new PM always runs -latesimplifycfg rather than -simplifycfg. Test to show it:
Given that these patches cited SPEC and test-suite perf while changing the behavior of simplifycfg:
…the differences mentioned here might be related?
Kindly request review so we can close this hole.
The new pass manager is getting extremely close to the point where I'm
not aware of any significant outstanding work needed, and I'd like to see
what else would be needed to enable it by default. Here are the current
functionality I'm aware of outstanding:
1) Does not do non-trivial loop unswitching. Majority of this is in
https://reviews.llvm.org/D34200 but will need one or two small
2) Currently, sanitizers don't work correctly with it. Thanks to the
work of others, the missing infrastructure has been added and I'll send a
patch to wire this up this week.
3) Missing support for 'optnone'. I've been working on this, but the
existing testing wasn't as thorough as I wanted, so it is going slowly.
I've got about 1/4 of this implemented and should have patches this week or
4) Missing opt-bisect (or similar) facility. This looks pretty trivial
to add, but I've not even started. If anyone is interested in it, go for
it. We might even be able to do something simpler using the generic debug
counters and get equivalent functionality.
... that's it?
Missing support of 'print-after-all' or 'print-after-passX' is another
major usability loss.
Regarding the default switch, maybe do it in two steps:
1) Switch the default for PGO build first
2) Switch the default for all modes.
1) PGO users benefit from the PM change the most
2) I suppose incremental changes are always better and causes fewer churns?
I'm somewhat nervous about having the PGO builds differ so much from the
non-PGO builds. My thought is that: (Significant) regressions in execution
performance or compile time need to be fixed regardless. Maybe there's some
flexibility on code-size regressions (e.g., such that we could delay the
transition for -Oz if necessary)?
Is that another reason (Oz) we want to do the transition selectively
first? For PGO, the size and compile time regression is usually not the
main concerns, so it is the candidate basically with no roadblocks for
Thank you for information. I’ll check if the regressions are related to this.
I’ve got results of our private benchmarks. Clang options were the same. There are a few regressions. The biggest regression is 8.24%. We’ve got a lot of improvements in scores and code size. The biggest score improvement is 22.07%. The biggest code size improvement is 16.10%. Benchmark binaries are quite big: improved ones are greater than 1 Mbytes.
I’ll check impact on AArch32 and investigate 1000% regression.
I ran testing for compile time on CTMark, the numbers are below:
==== O0-g ====
==== Os ====
==== O3 ====
So, in general, it's a win, but there are a couple of big outliers: sqlite3 on Os and kimwitu++/kc on O3. It would be good to understand if we can avoid these slowdowns.
I just want to send a quick thanks to the numerous folks sending benchmark numbers my way, this is tremendously useful. I’m going to go through and triage these, but if any of you are feeling motivated and want to file a bug to track specific ones and assign it to me, that also works. Big thing that would help is just to confirm the exact architecture used. I think I know based on who posted them, but can be useful. =D
Largely, I agree with Hal regarding strategy: the kinds of regressions we’re seeing here really just need to be either fixed or deeply understood and explained as not something we care about. And I think that is true for all of execution time, code size, and compile time. So far, looking at the numbers, I’m not really nervous about any of this. Once that’s done, I don’t think we need to do any kind of elaborate incremental switch.
I’m on holiday for a week so there will be some delay in me at least starting to work on fixing these, but of course, if others want to jump on any of them, that’s always welcome.
We’ve just started testing the new pass manager this week and we ran into a 548x slowdown (i.e., 6.28s to 3443.83s) for one of the files from SPEC2017/blender. The issue arises only in debug builds due to the numerous calls to RefSCC::verify() and SCC::verify() in the LazyCallGraph implementation. Would it make sense to start predicating these calls with the EXPENSIVE_CHECKS macro, rather than NDEBUG?
Sorry, by debug build I actually meant asserts enabled. Thus, this issue can show up in either a debug or release build, if asserts are enabled.
I thought most of the extremely superlinear asserts were already behind EXPENSIVE_CHECKS, but we can add this one if you have a test case. Could you file a bug w/ a test case? I’d also be happy to try and just make the verify a bit less egregious. But I don’t have that version of SPEC…
Sure, I’ll file a bug and investigate a bit in the morning. I’ll also see if I can’t find a similar regression with one of the llvm test suite tests, so it will be easier to reproduce on your end.
Let me know if I can be of further assistance.
I compared IRs and see what the issue is. I have not yet found which pass is responsible for this.
The benchmark source: http://www.llvm.org/viewvc/llvm-project/test-suite/trunk/MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow.c?view=markup
The function of interest is ‘value’. There is a loop nest with small loops containing an if-statement. I see that the internal loops are fully unrolled. With the new pass manager, most of cmp+br instructions are replaced with select instructions. This results huge basic blocks (400+ instructions). So we create a lot of instructions which are now executed unconditionally. According to the execution profile of the “good” program a lot of braches are not taken. The “good” program has 1.01 Bn instructions executed. The “bad” program has 1.62 Bn instructions executed. Another problem is that a lot of the created instructions are FP instructions: SCVTF/FADD/FCVTZS. According to the Cortex-A57 optimization guide they are quite heavy:
SCVTF: latency 10, throughput 1
FADD: latency 5, throughput 2
FCVTZS: latency 10, throughput 1
I think now it is clear why the regression is so huge.
As soon as I identify the pass responsible for this I’ll create a bug report. Feel free to create it yourself if you know the pass. Just give a ticket number and I’ll add my results of analysis.