[RFC] TableGen-driven MacroFusion predicators generator

Hi all! I posted some PRs that implement TableGen-driven MacroFusion framework, please feel free to send feedbacks to me!

What’s MacroFusion

Informally, MacroFusion is a technique that multiple instructions (or uops) are fused into one in decoding phase before issuing to backend units. For example, lui+addi can be fused to one uop load i32 immediate for some RISC-V implementations. This technique is normal to increase backend thoughput for modern CPU designs.

This requires these instructions are glued together.

LLVM’s MacroFusion implmentation

Simply speaking, MacroFusion implmentation in LLVM is a DAG mutation that adds Artificial dependency to fused instruction pairs so that they will not be separated by scheduler.

For targets that support MacroFusion, they need to implement a predicator with type:


using ShouldSchedulePredTy = std::function<bool(const TargetInstrInfo &TII,
                                                const TargetSubtargetInfo &TSI,
                                                const MachineInstr *FirstMI,
                                                const MachineInstr &SecondMI)>;

And then create a MacroFusion mutation and add it to scheduler:

std::unique_ptr<ScheduleDAGMutation> llvm::createRISCVMacroFusionDAGMutation() {
  return createMacroFusionDAGMutation(shouldScheduleAdjacent);
}

ScheduleDAGInstrs *
createMachineScheduler(MachineSchedContext *C) const override {
  const RISCVSubtarget &ST = C->MF->getSubtarget<RISCVSubtarget>();
  if (ST.hasMacroFusion()) {
    ScheduleDAGMILive *DAG = createGenericSchedLive(C);
    DAG->addMutation(createRISCVMacroFusionDAGMutation());
    return DAG;
  }
  return nullptr;
}

TableGen-driven MacroFusion predicators generator

For current implementation, we need to write C++ code to test if the instruction pair can be fused. These code are usually tedious and not easy to maintain.

Another problem is that macro fusion is related to specific processor, but now we abstract macro fusions to different subtarget features for all targets. As far as I know, there may be up to 30+ instruction pairs that can be fused for some macroarchitecture (for example, , Xiangshan - Decode Unit, Chinese). If we define subtarget features for all of them, that would be a disaster. And there will be a lot of branches that are evalated to false in predicator like below:

static bool shouldScheduleAdjacent(const TargetInstrInfo &TII,
                                   const TargetSubtargetInfo &TSI,
                                   const MachineInstr *FirstMI,
                                   const MachineInstr &SecondMI) {
  const RISCVSubtarget &ST = static_cast<const RISCVSubtarget &>(TSI);

  // false if not supported.
  if (ST.hasLUIADDIFusion() && isLUIADDI(FirstMI, SecondMI))
    return true;

  return false;
}

So I am trying to introduce a TableGen-based MacroFusion framework to simplify the code and I will explain how I implement it.

TableGen Part

The TableGen part is a MacroFusion class which contains four predicates:

class MacroFusion<MCInstPredicate first, MCInstPredicate second,
                  list<MacroFusionPredicateBase> prolog = [],
                  list<MacroFusionPredicateBase> epilog = []> {
  MCInstPredicate First = first;
  MCInstPredicate Second = second;
  list<MacroFusionPredicateBase> Prolog = prolog;
  list<MacroFusionPredicateBase> Epilog = epilog;
}

first and second are predicates for FirstMI and SecondMI. They can be any predicate classes defined in TargetInstrPredicate.td.
And prolog and epilog are predicates that will be added before/after predicates for first and second. They can be any MacroFusionPredicateBase class. For example:

// Base class of MacroFusionPredicate, etc. The avaliable variables are:
// * const TargetInstrInfo &TII
// * const TargetSubtargetInfo &STI
// * const MachineRegisterInfo &MRI
// * const MachineInstr *FirstMI
// * const MachineInstr &SecondMI
class MacroFusionPredicateBase;

// MacroFusionPredicate with raw code predicate.
class MacroFusionPredicate<code pred> : MacroFusionPredicateBase {
  code Predicate = pred;
}

// Binds firstOpIdx and secondOpIdx. The operand of `FirstMI` at position
// `firstOpIdx` should be the same as the operand of `SenondMI` at position
// `secondOpIdx`.
class BindReg<int firstOpIdx, int secondOpIdx> : MacroFusionPredicateBase {
  int FirstOpIdx = firstOpIdx;
  int SecondOpIdx = secondOpIdx;
}

// A predicate for wildcard. The generated code will be like:
// ```
// if (!FirstMI)
//   return ReturnValue;
// ```
class WildcardPred<bit ret> : MacroFusionPredicateBase {
  bit ReturnValue = ret;
}
def WildcardFalse : WildcardPred<0>;
def WildcardTrue : WildcardPred<1>;

// Indicates that the destination register of `FirstMI` should be have one
// use if it is an virtual register.
class OneUsePred : MacroFusionPredicateBase;
def OneUse : OneUsePred;

The generated predicator will be like:

bool isNAME(const TargetInstrInfo &TII,
            const TargetSubtargetInfo &STI,
            const MachineInstr *FirstMI,
            const MachineInstr &SecondMI) {
  auto &MRI = SecondMI.getMF()->getRegInfo();
  /* Prolog */
  /* Predicate for `FirstMI` */
  /* Predicate for `SecondMI` */
  /* Epilog */
  return true;
}

A minimal example for RISC-V’s lui+addi fusion can be described like:

def LUIADDI : MacroFusion<CheckOpcode<[LUI]>, CheckOpcode<[ADDI]>;

The generated predicator will be like:

bool isLUIADDI(
     const TargetInstrInfo &TII,
     const TargetSubtargetInfo &STI,
     const MachineInstr *FirstMI,
     const MachineInstr &SecondMI) {
  auto &MRI = SecondMI.getMF()->getRegInfo();
  auto matchFirst = [&](const MachineInstr* MI) {
    return ( MI->getOpcode() == RISCV::LUI );
  };
  auto matchSecond = [&](const MachineInstr* MI) {
    return ( MI->getOpcode() == RISCV::ADDI );
  };
  if(!matchFirst(FirstMI))
    return false;
  if(!matchSecond(&SecondMI))
    return false;
  return true;
}

(Please see below for the whole example)

SchedMachineModel Part

A list of MacroFusion is added to SchedMachineModel to indicate supported MacroFusions:

class SchedMachineModel {
  ...
  // List of MacroFusion.
  list<MacroFusion> MacroFusions = [];
}

And a new function which returns all supported MacroFusion predicators is added to TargetSubtargetInfo:

  /// Get the list of MacroFusion predicates.
  virtual std::vector<MacroFusionPredTy> getMacroFusions() const { return {}; }

This virtual function will be overrided by TableGen-generated XXXGenSubtargetInfo and returns predicator list by SchedModel’s ID:

std::vector<MacroFusionPredTy> RISCVGenSubtargetInfo::getMacroFusions() const {
  switch(getSchedModel().getProcessorID()) {
    case 2: // SiFive7Model
      return {isLUIADDI};
  }
  return {};
}

So now we can create MacroFusion mutation like below (note that I have changed the createMacroFusionDAGMutation to accept a list of predicator pointer):

ScheduleDAGMILive *DAG = createGenericSchedLive(C);
DAG->addMutation(createMacroFusionDAGMutation(ST.getMacroFusions()));

SubtargetInfo Part

For visibility of supported MacroFusions, C++ enums of all MacroFusions are generated and new hasMacroFusion function is added to MCSubtargetInfo to test if a MacroFusion is supported. So, we don’t need to add subtarget features for macro fusions now.

#ifdef GET_MACRO_FUSION_ENUM
namespace llvm {
namespace RISCV {
enum {
  LUIADDI = 0,
};
} // end namespace RISCV
} // end namespace llvm

// Test if we support `lui+addi` fusion.
ST.hasMacroFusion(RISCV::LUIADDI);

This is implemented by adding a Bitset to MCSchedModel:

const unsigned MaxMacroFusions = 256;
using MacroFusionBitset = Bitset<MaxMacroFusions>;

struct MCSchedModel {
  ...
  const MacroFusionBitset *MacroFusionBits;
}

The hasMacroFusion is simply implemented as querying the Bitset:

  bool hasMacroFusion(unsigned MacroFusion) const {
    const MacroFusionBitset *MacroFusionBits =
        CPUSchedModel->getMacroFusionBits();
    return MacroFusionBits && MacroFusionBits->test(MacroFusion);
  }

And of cource, the Bitset is generated by TableGen (note that we can get supported MacroFusions from SchedMachineModel.MacroFusions ):

static const MacroFusionBitset XXXModelMacroFusionBits  = {
  RISCV::LUIADDI,
  ...
};

static const llvm::MCSchedModel XXXModel = {
  ...
  &XXXModelMacroFusionBits,
};

Example

Let take RISC-V’s lui+addi fusion as an example:

def LUIADDI: MacroFusion<CheckOpcode<[LUI]>,
                         CheckAll<[
                           CheckOpcode<[ADDI, ADDIW]>,
                           CheckAny<[
                            CheckIsVRegOperand<0>,
                            CheckSameRegOperand<0, 1>
                          ]>
                         ]>,
                         [WildcardTrue],
                         [OneUse, BindReg<0, 1>]>;

It generates code like:

bool isLUIADDI(
     const TargetInstrInfo &TII,
     const TargetSubtargetInfo &STI,
     const MachineInstr *FirstMI,
     const MachineInstr &SecondMI) {
  auto &MRI = SecondMI.getMF()->getRegInfo();
  auto matchFirst = [&](const MachineInstr* MI) {
    return ( MI->getOpcode() == RISCV::LUI );
  };
  auto matchSecond = [&](const MachineInstr* MI) {
    return (
    (
      MI->getOpcode() == RISCV::ADDI
      || MI->getOpcode() == RISCV::ADDIW
    )
    && (
      MI->getOperand(0).getReg().isVirtual()
      || MI->getOperand(0).getReg() == MI->getOperand(1).getReg()
    )
  );
  };
  if (!FirstMI)
    return true;
  if(!matchFirst(FirstMI))
    return false;
  if(!matchSecond(&SecondMI))
    return false;
  Register FirstDest = FirstMI->getOperand(0).getReg();
  if (FirstDest.isVirtual() && !MRI.hasOneNonDBGUse(FirstDest))
    return false;
  if (!(FirstMI->getOperand(0).isReg() &&
        SecondMI.getOperand(1).isReg() &&
        FirstMI->getOperand(0).getReg() == SecondMI.getOperand(1).isReg()))
    return false;
  return true;
}

And if we add MacroFusions to SchedMachineModel:

def TestModel : SchedMachineModel {
  let MicroOpBufferSize = 0;
  let IssueWidth = 2;
  let LoadLatency = 3;
  let MispredictPenalty = 3;
  let CompleteModel = 0;
  let MacroFusions = [LUIADDI]; // Supported macro fusions
}

It generates:


static const MacroFusionBitset TestModelMacroFusionBits  = {
  RISCV::LUIADDI
};

static const llvm::MCSchedModel TestModel = {
  2, // IssueWidth
  0, // MicroOpBufferSize
  MCSchedModel::DefaultLoopMicroOpBufferSize,
  3, // LoadLatency
  MCSchedModel::DefaultHighLatency,
  3, // MispredictPenalty
  false, // PostRAScheduler
  false, // CompleteModel
  false, // EnableIntervals
  0, // Processor ID
  TestModelProcResources,
  TestModelSchedClasses,
  10,
  4733,
  nullptr, // No Itinerary
  nullptr, // No extra processor descriptor
  &TestModelMacroFusionBits,
};

I have written this RFC in a hurry, so please feel free to discuss with me if there are any questions!

Thanks for working on this. It sounds useful to me. AMDGPU has only one fairly simple macro, which could be converted to use this: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AMDGPU/AMDGPUMacroFusion.cpp

1 Like

Instead of having a BitSet for each possible macrofusion. Would it be possible generate methods like hasLUIADDI() fusion in RISCVGenMCSubtargetInfo that checked if the processor ID of the scheduler model is one that supports that specific fusion?

For the function passed to shouldScheduleAdjacent, could tablegen make a single function that checked the processor ID and called all the macrofusion predicates that apply to that processor ID. We already have autogenerated functions like resolveSchedClass that check the processor ID.

Yes, we can do these if we don’t consider testability.
But, in order to make each macro fusion testable, we need mutable status to indicate whether we support certain macro fusions. I’m afraid that the hasLUIADDI() way and something like resolveSchedClass are not runtime mutable.
FYI, please see [TableGen] Enhance testability of TableGen-based macro fusion by wangpc-pp · Pull Request #73075 · llvm/llvm-project · GitHub