[RFC][GlobalISel] Replace the current GlobalISel matcher with a bottom-up matcher

Hi!

The current GICombiner can match an instruction pattern sequence, but this feature is basically not used in combiner rules. One reason is that the matching capability is somewhat limited. Let’s look at some examples.

I can easily match a G_ICMP/G_ZEXT/GSUB sequence:

  (match (G_ICMP $icmp, $cc, $srcb, $srcc),
         (G_ZEXT $zext, $icmp),
         (G_SUB $dst, $srca, $zext))

Just change G_SUB to G_ADD:

  (match (G_ICMP $icmp, $cc, $srcb, $srcc),
         (G_ZEXT $zext, $icmp),
         (G_ADD $dst, $srca, $zext))

and it does not work as expected, because the current implementation does not handle commutable instructions like the C++ mi_match(). In order to get the desired outcome, you have to add another combine rule, just with the $srca and $zext operands swapped. Obviously, just using C++ code seems easier.

Even more annoying, turning to a simple tree pattern like

  (match (G_AND $srca, $t1, $t2),
         (G_AND $srcb, $t3, $t4),
         (G_OR $dst, $srca, $srcb)) 

just crashes TableGen.

I think there is value in making the pattern matching more useful. To solve both problems, I implemented a bottom-up matcher in D141135. It is a drop-in replacement for the current matcher, and passes all LIT tests without regressions.

The basic idea of a bottom-up matcher is to associate each instruction the set of matching patterns, called the match set. For an instruction without use operands (a leaf) the match set is easily determined. For any other instruction, the match set is retrieved via a table lookup, using the use operands as table indices. As a result, all matching patterns can be found in linear time with one tree traversal.

Commutable instructions are handled by adding the variant patterns resulting from swapping the subpatterns. For the resulting code, the implication is that only the variable binding varies. The rule code is not duplicated.

This implementation is based in the algorithm from Chase as described in his paper An improvement to bottom-up tree pattern matching.

Some numbers from Linux on Z (orig vs change applied):
Compiling LLVM is slightly longer with this change applied: 25m35.471s vs 28m51.797s
The resulting llc binary is slightly smaller: 145.899.472 bytes vs 145.849.704 bytes
I’ve got similar results in Linux on Power.
I have not yet measured the impact of the resulting matcher on the compile time.

One interesting point to note:
The current combiner implementation loops over the basic blocks of a machine functions until a fixpoint is reached. The bottom up matcher matches all patterns in one round, so in theory a single loop of the instructions should be enough. However, there are 2 reasons why this is not implemented:

  • The last loop over the instructions removes trivially dead instructions, which would not happen if only a single loop over the instructions is done. This has no functional impact. The next combiner, or latest the instruction selector removes trivially dead instructions. However, lot of test case would need to be changed because they do not expect those dead instructions. A simple solution would be to make a second loop over the instructions, just removing the trivially dead instructions, which would be faster than doing a full iteration.
  • More important, not all patterns would be matched! The reason is that there are C++ matchers which perform a top-down match. One example is the rule fold_global_offset from the AArch64PreLegalizerCombiner. This rule matches a G_GLOBAL_VALUE only, but the C++ matcher looks at the uses of the defined value, which basically means that the matcher reverses the order in which matches are made. As a result, the bottom-up matcher is not aware that it has to apply another rule. I think this case can be fixed by changing the matching pattern. The question is if this is desirable, because it would add restrictions to what a C++ matcher is allowed to do.

Please let me know what you think about this approach.

Kind regards,
Kai

1 Like

I think this idea definitely sounds interesting. And as you also mentioned, I would love to see the impact on compilation time as well as the performance / code size of the compiled binary. I think llvm-test-suite would be a good benchmark for this purpose.

Thanks!

I currently setup a AArch64 instance to get some more compile time numbers. Besides the LLVM test suite I plan to bootstrap clang with this change.

Could the same technique be used to create a PowerPC or AARCH64 instruction matcher for a PostIsel combiner?

Yes, as long as the instructions are still in SSA form.

With my latest update to the patch, I get the following compile time for the LLVM test suite with -fglobal-isel on an AArch64 EC2 instance (2 CPUs, 4GB):
Current implementation: 22m43.224s
Bottom-up implementation: 22m47.165s

Size of the binaries:
Current implementation: llc 127.280.024, clang 217.540.416
Bottom-up implementation: llc 127.026.960, clang 217.291.456