[RFC] GlobalISel support for X86

Introduction

GlobalISel is another approach for instruction selection comparing to
SelectionDAG.

The main features, as I can highlight, are less time required for
selection and selection at function level. In contrast, SelectionDAG operates inside basic
block scope and function level optimizations mostly done by CodeGenPrepare pass.

Starting migration from SelectionDAG to GlobalISel is especially painful for X86
due to highly specialized and optimized DAG combiners.

This RFC is about suggesting the first steps in this journey.

Current state

GlobalISel has some support for X86 back-end. Despite of the fact, we can see
issues closed as obsolete or won't fix and suggestions to not use GlobalISel
in production. Also there is a warning in clang if you try to use
-fglobal-isel for X86 target.

However, there are commits mostly from @RKSimon and @tschuett that
introduce new legalization rules or opcode selection for X86.

Current approach

Iā€™ve taken a several commits that introduce new legalization of opcodes. But
after legalization they cannot be selected. It looks a little bit strange to me.
Weā€™ve made a functional change in compiler, but nothing has changed in terms of
produced code. Why is it important?

During my experiments with GlobalISel coverage, I needed G_SELECT support:
hooray, legalization is already there, only selection is needed to add. But if
you try to reuse selection tables from SelectionDAG to select G_SELECT, you
fail. There is no patterns to select ISD::SELECT, so, we either need to write
selection in C++, or write a selection pattern for SelectionDAG so that ISD::SELECT
obtained from GlobalISel can match it. Also we can rewrite legalization into splitting
G_SELECT into G_ICMP with G_CMOV (target dependent opcode).

The idea here is that fixing legalization without a plan how to select
legalized opcodes may create problems.

SelectionDAG compatibility

The current state of X86InstructionSelector is to reuse SelectionDAG tables and
if failed, try to use C++ selection that should be written especially for
GlobalISel.

It is a straightforward idea, however SelectionDAG tables contain X86ISD
nodes that have no mapping at all from GlobalISel opcodes. Iā€™ve met this
problem when tried to support G_FSHL and G_FSHR
D157505 when we have specialized
X86ISD::FSHL/X86ISD::FSHR nodes that are used only for i16 type.

Target intrinsic support

Unfortunately, I havenā€™t found a target that has rich support of intrinsics. On most targets
(AArch64, M68k, PowerPC, RISCV, X86) intrinsics are completely rewritten or omitted. It is generally not
an option for x86 (as well as for other targets, I think).

Specifically for X86, intrinsics are written as a mapping from IntrinsicID to ISD or X86ISD node.
To reuse X86IntrinsicsInfo.h, we need two things:

  1. The existing mapping including not only SelectionDAG nodes but GlobalISel nodes.
  2. Add GlobalISel nodes and GINodeEquiv declaration to map GlobalISel nodes during instruction
    selection

If the first is mostly a refactoring issue, the second looks like the complete reimplementation.
However, if we admit that one way or another most of the X86ISD nodes will have an
equivalent in GlobalISel for purposes alternative to intrinsics then it is not a problem.

Alternatively, Iā€™d like to ask authors of GlobalISel whether there is a view to provide generalized
intrinsic support. Now, GlobalISel users prefer to ignore target intrinsics or reimplement them from
the scratch.

Combiners

At this moment it is hard to predict something about combiners since only AArch64, Mips and AMDGPU
implement them. If someone has any thoughts about it, please share.

From my perspective, at initial stage we only may think about how frequently we introduce target
specific GlobalISel nodes during legalization. Because here we determine whether we want to introduce
target specific combine patterns after legalization or we try to keep generic nodes until instruction
selection to allow the generic combiner working.

AArch64 experiment

To roughly estimate the existing support, Iā€™ve taken AArch64 backend (considering its GlobalISel support as one of the most advanced) and done the following:

  1. Modified IRTranslator pass to dump LLVM IR of a function when GlobalISel
    representation has a few G_* opcodes (ideally only one, but this is
    not always possible)
  2. Compiled the IRs for X86 using GlobalISel meanwhile updating the IRs from
    AArch64 specifics.
  3. Collected statistics is out of 212 (on the moment of the experiment) generic GlobalISel opcodes:
    1. Generic opcodes not appeared in AArch64 with GlobalISel from .ll to .mir: 43
    2. Triggered internal assert about PhysReg copy: 2
    3. Cannot be legalized: 110
    4. Cannot be selected: 11
    5. Successfully compiled: 47

Itā€™s clear that this experiment may be too pessimistic as we may hit vector
typed opcodes when only scalar version is supported, or some f128 types, or
multi-opcode IRs where only auxiliary one cannot be compiled and therefore it
gates the positive result. Nevertheless, this statistic shows that we can
compile barely 25% of isolated GlobalISel opcodes.

C/C++ coverage

After the experiment with isolated GlobalISel opcodes, Iā€™ve decided to use C and
C++ Validation Test Suites. Unfortunately, no tests can be compiled:

  1. G_GLOBAL_VALUE isnā€™t supported for fpic mode
    D157396
  2. Apparently, there is a place assuming that after selection all virtual
    registers have a register class. However, it is not true for GlobalISel (itā€™s not X86
    specific, AArch64 has the same behavior) D157458
  3. All tests require G_SELECT. The problem with its selection is described above.

Everything else besides opcodes

I couldnā€™t identify any other issues that may require a systematic approach in the first place, e.g. calling conventions or TableGen issues.

Proposal

The general idea is to repeat AArch64 path: to start with basic support without
any combiners or optimizations i.e. -O0 GlobalISel.

  1. Understand the current coverage:
    1. C and C++ Validation Test Suites are used as a criterion for X86 GlobalISel
      support.
    2. LLVM testsuite with enforced -global-isel for more complete testing
    3. Reuse tests from AArch64 experiment for per-opcode testing. These tests
      will be becoming useless when more opcodes supported, and proper tests are added
      into LLVM Testsuite.
  2. Follow opcode-oriented approach: when adding support for a single opcode,
    support it from IRTranslator till InstructionSelector (at once or at least
    have understanding how to do it).
  3. Prioritization of opcodes can be done using testsuites from point 1. But
    general idea is a two axes plane:
    1. Start supporting from scalar versions and end with vectorized.
    2. Start from basics (add, shl, mul) and end with independent and target
      intrinsics (llvm.returnaddress, llvm.sse.*)
  4. During implementation I propose to reach the maximum reuse of existing
    SelectionDAG tables. However, we may try to create
    GlobalISel specific patterns as an alternative.

Open questions

To sum up, there are two questions I donā€™t have answer for.

  1. How to minimize efforts for target intrinsics support?
  2. Are there concerns that should be considered now for further steps with combiners.

@RKSimon @phoebe @tschuett @nikic @arsenm

1 Like

This is not true. Target intrinsics pass through G_INTRINISC* opcodes and the DAG imported patterns understand the same int_target_xxx patterns. The X86ISD opcodes are not equivalent to intrinsics, they are equivalent to target defined pseudo instructions. You can define a mapping from target defined pseudos to target defined SDNodes such that patterns still work (e.g. G_AMDGPU_FFBH_U32)

This is just a bug. Everything after selection should have a register class

Iā€™ve always been of the opinion that trying to catch up to SelectionDAG is an insane task to take on initially.

My work on X86 GlobalISel tends to wax and wane, but the plan at that back of my head has been to see if we can match FastISel first, a lot of the emitter code is very similar already and Iā€™d dearly like to not end up with 3 Selection modes for very longā€¦

Weā€™d then delete X86FastISel.cpp (and use GlobalISel for -O0) and weā€™d have a strong footing to THEN start adding optimization to begin to match SelectionDAG.

As well as that, Iā€™d like to see more base level test files check + compare SelectionDAG/FastISel/GlobalISel codegen - beginning with bitops/shifts for both scalar/vector would be a start, then add/sub/mul/div then control flow etc. I donā€™t like the way that so many duplicate codegen test files are appearing in the CodeGen\X86\GlobalISel subdir - that should just be for specific GISel legalization/selection pass tests.

1 Like

I used the remarks to find gaps in AArch64. It might work similarly for X86:

cmake -G Ninja -DCMAKE_BUILD_TYPE=Release -DLLVM_TARGETS_TO_BUILD="AArch64" -DLLVM_USE_LINKER=lld -DLLVM_ENABLE_PROJECTS="clang" -DCLANG_BOOTSTRAP_PASSTHROUGH="LLVM_TARGETS_TO_BUILD" -DCLANG_ENABLE_BOOTSTRAP=On -DBOOTSTRAP_CMAKE_CXX_FLAGS="-march=native -Rpass-missed='gisel*' -fglobal-isel -mllvm -global-isel-abort=2" ./llvm
ninja stage2

e.g.

remark: unable to legalize instruction: %98:_(<16 x s64>) = G_ZEXT %97:_(<16 x s8>) [-Rpass-missed=gisel-legalize]

Note that the size of the output type is too large for AArch64.

AArch64 has a post legalizer combiner that creates target specific pseudo instructions, e.g.:

1 Like

Here I mostly refer to AArch64.
I see that there are two separate implementations for Intrinsic::aarch64_crypto_sha1h intrinsic.
GlobalISel specific is in lib/Target/AArch64/GISel/AArch64InstructionSelector.cpp. SelectionDAG is through DAG pattern def SHA1Hrr : SHAInstSS< 0b0000, "sha1h", int_aarch64_crypto_sha1h> (more likely that I miss something and one complements another).

But yes. Thank you, I was wrong. Now I see that DAG patterns of intrinsics are finely selected. So, now itā€™s a problem of intrinsics without patternsā€¦

Iā€™ve always been of the opinion that trying to catch up to SelectionDAG is an insane task to take on initially.

Yes, sure, there are no plans to bring up everything at once. The initial plan is to gradually increase the amount of code we can correctly select with GlobalISel without any optimization.

if we can match FastISel first

I have concerns that matching FastISel is not enough. When FastISel is able to fallback to SelectionDAG for a single block, GlobalISel may fallback during final selection (after legalization and reg bank selection) and we have to redo all the work.

I donā€™t like the way that so many duplicate codegen test files are appearing in the CodeGen\X86\GlobalISel subdir - that should just be for specific GISel legalization/selection pass tests.

So, do you mean that instead of testing/checking all stages of selection we should rather add a run with GlobalISel to existing tests (letā€™s say end-to-end selection testing)?

End-to-end tests are great to show the differences in the generated code. For changes to the legalizer or select there should be mir tests, because you can test the pass in isolation.

I kind of wonder what @topperc 's opinion is here.

To me, it seems like a TON of work.

1 Like

We found by setting the register banks for intrinsics, our existing patterns just worked. We parse the Intrinsic::IITDescriptor and select the appropriate bank based on the type as a general solution which managed to get almost all of our intrinsics to work end to end.

Weā€™d then delete X86FastISel.cpp (and use GlobalISel for -O0)

Compile times would likely increase quite a bit with that. At least on AArch64, GlobalISel at -O0 is much slower (>2x) than FastISel. The IRTranslator alone takes nearly as much time as FastISel in its entirety on my (fairly limited) benchmarks. From what I gather GlobalISel is unlikely to match FastISel performance any time soon due to its multi-pass approach.

I which case it sounds like this is the area that needs work first. I canā€™t be the only one with no interest in ending up supporting 3 different selection mechanisms: xkcd: Standards

1 Like

Thatā€™s interesting - a lot of time has obviously passed since then, but the 2017 GISel slides showed much more promising performance vs FastISel. Do you have any idea if things have slowed down since then, or is this a case of different benchmarks showing different results?

Kind of late to the party, didnā€™t see the initial post!

I think this is a little bit of both but I wouldnā€™t be surprised if the biggest contributor is the fact that we didnā€™t track compile time closely. In other words, I think thereā€™s room for improvement.

FWIW, wrt FastISel, the bigger plan was to be able to generate a fast path for GISel from the TableGen patterns (selection, combines, and legalization). That plan has yet to materialize since efforts to move more stuff to TableGen for GISel have been slow.

Iā€™m happy to brainstorm on how to move forward but realistically I wonā€™t have time to contribute code that much (at least in the near future).

2 Likes