I was wondering what are the next steps in deciding whether the pass should be yet enabled by default. What are the things we need to address first?
I have used SPEC2017-int-rate and llvm-test-suite to evaluate the performance impact on AArch64 at -O3. The score for SPEC 505.mcf is improved by about 8.24735% when LTO is enabled. The geomean of the execution time for the llvm-test-suite decreases by 1.25633%. I also measured the impact on compilation times using instruction count as the metric, similarly to https://llvm-compile-time-tracker.com. The geomean is regressed by 0.3501985496914447%. Here is the breakdown:
There seems to be a discussion about supporting ThinLTO here: [ThinLTO] Import functions to enable function specialization in ThinLTO. My understanding after reading the thread is that the original proposal wouldn’t scale due to undeliberate importing of indirect specializations, and so the relevant patches were abandoned (D105966, D105524). Is there a new idea that is currently being worked on @ChuanqiXu? The thread also implies that Function Specialization could potentially replace IPSCCP. Is this perhaps a route we should pursue in order to keep the compilation times lower?
Here are our recent efforts to improve the Function Specialization pass:
I think there are multiple questions in the post. Let me try to replies to it one by one.
What is the pipelines we want to enable it?
I think there are 2 options: LTO pipeline or O3 & LTO pipeline. (O3 pipeline only doesn’t make sense). From your words, it looks like you want to enable it for O3 & LTO pipeline, is it? And if we want to enable O3 & LTO mode, we need to test it for both O3 and LTO enabled and give the corresponding results.
Should we enable Function Specialization?
I believe people wouldn’t complain if:
We could show that the implementation wouldn’t miscompile.
We could show that it wouldn’t cause a performance regression generally.
We could show that it wouldn’t cause the compile time, code-sizes and the used-memory to increase significantly.
Clearly, we couldn’t prove them. We could only test them as much as possible. For the test set, the current one looks like {SPEC2017, llvm-test-suite, projects-from-llvm-compile-time-tracker}. It looks good. And I would feel better if we could add LLVM itself and Chrome to the test set. Since LLVM and Chrome is large and they occurs in research often as a test for compilation time.
Then, for the correction requirement, I would suggest to specialize as many functions as it could to test the correction. I mean we should test it in LTO mode so that there would be more functions could get specialized. And we should set the parameters as aggressive as we could. For example, we could set FuncSpecializationMaxIters, MaxClonesThreshold and AvgLoopIterationCount larger and SmallFunctionThreshold smaller. Then we should enable SpecializeOnAddresses. In the case of all the tests passes, the more function we specialized, the more confidence we have to say it is correct.
For the requirement of performance, we should also test it for both O3 and LTO. And we should have a list to show how many projects become faster and how many projects become slower (if any) and the corresponding change ratio. Here you said the geomean of the execution time is decreased. My first intuition is that if there are some tests getting worse. I am not mean we couldn’t enable it if there is any regression. I just say more data is better.
For the last requirement, we should also test it for both O3 and LTO. The one key question is what’s the threshold about significant? I remember @nikic commented to me once when my patch causes the memory-uses increases 0.6%. So I wondering the threshold might be 0.5%? I am wondering if @aeubanks know this.
Supporting FuncSpec in ThinLTO
Yeah, as the thread you posted, there are some issues in my original proposal. So the corresponding patch gets abandoned. And I don’t have new ideas…
Could Function Specialization replace IPSCCP
Personally, I feel it is natural that Function Specialization is a super set of IPSCCP. So if Function Specialization is good enough, I think it is OK to replace IPSCCP in some pipelines. But the question is how to measure if Function Specialization is good enough. I am not sure about this. But I am sure the current status might not be good enough… But this is another topic. Let’s discuss it in other threads.
Thanks for your valuable input. I have been running correction tests with aggressive settings for the Function Specialization pass. So far I only tried multistage builds of clang and lld. Stage 1 with address sanitizer, stage 2 with LTO, then plain stage 3. Building clang with LTO is extremely slow by the way, that’s why I opted for a 3rd stage build instead of check-all at stage 2. I came across this crash at stage 3, but I haven’t been able to reproduce it since.
clang++: /data/oss-llvm/llvm-project/clang/include/clang/AST/DeclCXX.h:867: void clang::CXXRecordDecl::setImplicitDestructorIsDeleted(): Assertion `(data().DefaultedDestructorIsDeleted || needsOverloadResolutionForDestructor()) && "destructor should not be deleted"' failed.
I am puzzled that I found it at stage 3 and not stage 2, since function specialization was only enabled at stage 1, so I am wondering whether it’s related or not.
I think we could know that if it is related by experimenting it again without Function Specialization. BTW, I feel it is a little bit strange that the settings for stage 1-3 is different. According to my understanding to bootstrapping, the settings for stage1-3 should be the same though.
Could you also share stats about how often the optimization triggers for those programs? Some compile-time increase can be fine, if the optimization is somewhat general.
Also IIUC this is only for -O3, right? What’s the impact with LTO? With -Os?
IIUC function specialization is just an extension of IPSCCP. It would probably make sense to merge them, but I’d do it the other way round, as function specialization mostly just makes IPSCCP more effective?
It would be good to get an idea on what the compile-time gain would be. If it improves compile-time, that would be good to do before enabling.
I’ve also measured the execution times of the llvm-test-suite with LTO running on a single core (pinned with taskset). The geomean improves by 1.51973% on a Xeon Gold 5120T, and by 3.7059% on Neoverse N1. Unfortunately I could not collect statistics with LTO (the build failed with ld.lld: error: -plugin-opt=: unknown plugin option ‘stats-file=…’).
The compilation times didn’t look so good with LTO though. Here’s the instruction count increase recorded by perf on Xeon:
The first part says the improvement with LTO. But the second part says we failed to collect statistics with LTO… Do I misunderstand anything?
The first part says compilation time. And the second part says instruction count by perf. Do you mean you measure the compilation time by perf? I feel it is not accurate and unnecessarily complex. Could we measure the elapsed time simply?
As Florian suggested, it should be good to record the times the function specialization triggers.
The first part says the improvement with LTO. But the second part says we failed to collect statistics with LTO… Do I misunderstand anything?
Statistics, as in llvm statistics (number of specializations etc). That’s because the driver misinterpreted the ‘stats-file’ option for some reason.
The first part says compilation time . And the second part says instruction count by perf . Do you mean you measure the compilation time by perf? I feel it is not accurate and unnecessarily complex. Could we measure the elapsed time simply?
Elapsed time can be very noisy, and so the standardized method to measure the effect of a change on compilation times is to compare clang’s instruction count while compiling the llvm-test-suite. This information is available when instrumenting with perf. See https://llvm-compile-time-tracker.com
Oh, I feel instruction counts when execution is more noisy since the execution time of each instruction is different. But it might be fine here since we’re measuring for the changes in the middle end and it is hard to find clean environment.
Hi. I have been making an effort to merge Function Specialization with IPSCCP in order to reduce compilation times. The results are somewhat better than running the two passes separately, especially for non LTO runs. I have attached a speadsheet measuring instruction counts when compiling the CTMark subdir of the llvm-test-suite as before. The merge is non trivial as many bugs come up on the process, and so I would appreciate your opinion @ChuanqiXu, @fhahn on whether I should pursue this further based on these numbers.
I am worrying if you misunderstand me… I didn’t ask to merge Function Specialization and IPSCCP in this run now. I just feel like it might be a future direction. Sorry if I don’t say things clear enough.
Back to the topic about Function Specialization and IPSCCP. From the perspective of Function Specialization and IPSCCP, I think FuncSpec is a superset of IPSCCP so I think it is possible to merge them. Actually, I removed the constraint to not specialize single constant in FuncSpec internally and it runs well (I didn’t remove IPSCCP). But I am afraid that it might not be a good timing to merge them now due to FuncSpec is not so matured. My personal opinion would be: Enable FuncSpec by default → fix bugs/enhance functionality → remove IPSCCP finally.
Finally, the PDF you upload is broken and I couldn't see it...
I think merging the passes is an idea worth exploring and if we believe FuncSpec is still premature then it can remain disabled until proven otherwise. My patch ⚙ D126455 [FuncSpec] Make the Function Specializer part of the IPSCCP pass. uses an option to control whether FunSpec will run or not right after the original IPSCCP algorithm. The follow up patch ⚙ D126456 [SCCP] Notify the Solver when an instruction is removed. exposes latent bugs that currently don’t manifest as the SCCPSolver only runs in the beginning of the original IPSCCP pass, whereas FuncSpec runs it before and after modifying basic blocks.
Now I could see the results. From the numbers, it looks indeed better run FuncSpec and IPSCCP together from the perspective of compiling time. And the idea to merge IPSCCP into FuncSpec first and try to enable FuncSpec later looks fine. It should be a good idea all the time to make something better (I mean FuncSpec here).
But we need to be ready for that it is much harder to remove something than adding something.
I wouldn’t call this a latent bug. A fundamental assumption of the (IP)SCCP implementation is that first all solving is done and only then is the IR modified. If that needs changing I think it would be good to first know why this is the only option. Adding new functions for example shouldn’t be a problem for example, as the solver supports exploring blocks/functions as they become reachable.
I wouldn’t call this a latent bug. A fundamental assumption of the (IP)SCCP implementation is that first all solving is done and only then is the IR modified. If that needs changing I think it would be good to first know why this is the only option. Adding new functions for example shouldn’t be a problem for example, as the solver supports exploring blocks/functions as they become reachable.
I am not following, can you explain? By marking basic blocks reachable (and so executable), the solver just adds them to the worklist, which will only get processed when we actually invoke solve(). Is it wrong that the Function Specializer invokes the SCCP Solver for the newly created functions? Should we be doing something differently?
In an effort to reduce the compilation time overhead from running Function Specialization as a standalone pass I have put together a series of patches where I am merging it with SCCP. They might not be production ready, neither I am claiming we should definitely do this, but it was an idea worth exploring and I would like to share these patches with you and hopefully get some feedback.
I bumped onto many crashes while compiling the llvm test suite with Function Specialization enabled and tuned aggressive (FuncSpecializationMaxIters > 1, MaxClonesThreshold > default, SmallFunctionThreshold < default, AvgLoopIterationCount > default, EnableSpecializationForLiteralConstant = true). I also encountered many use-after-free errors when performing a bootstrap compile of clang with full LTO and Address Sanitizer instrumented binaries. They seem to be gone in my latest revision, but the code might still not be bulletproof.
As Florian suggested earlier the problem seems to be that the Solver is meant to run before we make changes to the IR, but Function Specialization now runs after (at the end of) the original SCCP pass. This wasn’t a problem when running it as a standalone pass.
Would you love to give similar stats for performance?
Some minor suggestions on the stats:
(1) For ctmark.pdf, it might be better to add a text in the PDF to say it measures the instruction counts in compiling time. So that people could know it measures compile time and the index is instruction counts. Another suggestion is to rewrite delta column from 0.600679176086455 into 0.6% so that reader could know it means percentage.
(2) For stats.pdf, the specialized counts is really small. I guess it doesn’t include LTO build, right? I guess it would be helpful.
(1) I’ve uploaded funcspec-llvmtestsuite-ctmark.pdf, where I’ve added the geomean and rounded the delta to three decimal digits. Also renamed the column to indicate precentage differences. Lastly added a note saying we are comparing instruction counts.
(2) The stats are coming from an LTO build.
Regarding performance numbers, last time I checked, the score of 505.mcf from SPECv6 was improved by about ~8.5% on Neoverse-N1 (at -O3 with LTO). The patches where otherwise neutral for all other tests in SPEC. The results from running the llvm test suite (using the same configuration) were inconclusive. The geomean of the execution time was noisy with negligible deltas (sometimes positive, sometimes negative).
Thanks. If the compile time change looks fine, the highest increase is 0.3% in NoLTO mode. Even with LTO enabled, the highest increment 3% looks fine to me since we generally don’t care about compile time too much with LTO. So if it wouldn’t introduce performance regression and build breaks, it looks fine to me.
I think adding new blocks in newly created functions should just work without any problems. The issue is if you change instructions/basic blocks the solver already visited. Ideally this should be avoided.