[RFC] Measuring GlobalISel compile-time performance

Pierre-vh · May 14, 2024, 6:25am

I’ve been profiling the combiners recently. I’m using perf record from Linux through a GUI tool called HotSpot, which allows me to zoom in on specific parts of the build process and filter events that only happened during a given function call. With that, I can filter on something like tryCombineAll for the PreLegalizerCombiner and see where the time is spent.

So far I did:

github.com/llvm/llvm-project

[GlobalISel] Micro-optimize getConstantVRegValWithLookThrough

llvm:main ← Pierre-vh:microopt-getConstantVRegVal

opened 02:09PM - 13 May 24 UTC

Pierre-vh

+44 -32

I was benchmarking the MatchTable when I found that `getConstantVRegValWithLookT…hrough` took a non-negligible amount of time, about 7.5% of all of `AArch64PreLegalizerCombinerImpl::tryCombineAll`. I decided to take a closer look to see if I could squeeze some performance out of it, and I landed on a few changes that: - Avoid copying APint unnecessarily, especially returning std::optional<APInt> can be expensive when a out parameter also works. - Avoid indirect call by using templated function pointers instead of function_ref/std::function Both of those changes seem to speedup this function by about 50%, but my benchmarking (`perf record`) seems inconsistent (so take measurements with a grain of salt), I saw as high as 4.5% and as low as 2% for this function on the exact same input after the changes, but it never got close again to 7% in a few runs so this looks like a stable improvement.

It’s not an exact science, I’ve found a few % of variance in the timing of some functions depending on runs, but it tells me where we waste the most time which is still useful.

For instance, if we zoom in on AArch64PreLegalizerCombiner, we see that:

executeMatchTable takes 75% of its execution (including all callees), but only 10% of the time is spent in that function (self time) ! So the MatchTable is very fast already.
getKnownBits is as expensive as the MatchTable with 10% use. 7.47% of that is in getKnownBitsImpl, the DenseMap construction takes 1.83%. I think this is due to KnownBits being an expensive object.
matchICmpToTrueFalseKnownBits is very expensive with 8% time spent too, due to calls to getKnownBits on all G_ICMP occurences.
The MatchInfoTy struct is very expensive to construct every iteration (every instruction visited). operator= seems to take about 7.47% and the constructor 4.37%)! I’m going to look at optimizing that very soon.
- My current idea is to create a “lazy” allocator that only allocates a field when it’s requested, so we don’t pay for what we don’t use.

Topic		Replies	Views
GlobalISel design update and goals LLVM Dev List Archives	17	323	August 13, 2018
[GlobalISel][AArch64] Toward flipping the switch for O0: Please give it a try! LLVM Dev List Archives	101	945	January 2, 2018
[RFC] GlobalISel support for X86 X86	13	1617	April 25, 2024
[RFC] MatchTable-based GlobalISel Combiners Common CodeGen Infrastructure globalisel , llvm	12	3003	July 19, 2023
RFC: Building GlobalISel by default LLVM Dev List Archives	39	472	January 29, 2017

[RFC] Measuring GlobalISel compile-time performance

Related topics