I’ve been profiling the combiners recently. I’m using perf record from Linux through a GUI tool called HotSpot, which allows me to zoom in on specific parts of the build process and filter events that only happened during a given function call. With that, I can filter on something like tryCombineAll for the PreLegalizerCombiner and see where the time is spent.
So far I did:
It’s not an exact science, I’ve found a few % of variance in the timing of some functions depending on runs, but it tells me where we waste the most time which is still useful.
For instance, if we zoom in on AArch64PreLegalizerCombiner, we see that:
executeMatchTable takes 75% of its execution (including all callees), but only 10% of the time is spent in that function (self time) ! So the MatchTable is very fast already.
getKnownBits is as expensive as the MatchTable with 10% use. 7.47% of that is in getKnownBitsImpl, the DenseMap construction takes 1.83%. I think this is due to KnownBits being an expensive object.
matchICmpToTrueFalseKnownBits is very expensive with 8% time spent too, due to calls to getKnownBits on all G_ICMP occurences.
The MatchInfoTy struct is very expensive to construct every iteration (every instruction visited). operator= seems to take about 7.47% and the constructor 4.37%)! I’m going to look at optimizing that very soon.
My current idea is to create a “lazy” allocator that only allocates a field when it’s requested, so we don’t pay for what we don’t use.