getNodePriority()

We have a function that has 256 loads and 256 fmuladds. This block of operations is bounded at either end by an OpenCL barrier (an AMDIL fence instruction). The loads and multiply/adds are ordinarily interleaved… that is, the IR going in to code generation looks like:

%39 = load float addrspace(3)* getelementptr inbounds ([16 x [17 x float]] addrspace(3)* @sgemm.b, i32 0, i32 0, i32 0), align 4

%40 = call float @llvm.fmuladd.f32(float %37, float %39, float %c0.037) nounwind

%41 = load float addrspace(3)* getelementptr inbounds ([16 x [17 x float]] addrspace(3)* @sgemm.b, i32 0, i32 0, i32 1), align 4

%42 = call float @llvm.fmuladd.f32(float %37, float %41, float %c1.036) nounwind

… and 254 more of these pairs.

%39 and %41 (and 254 more loads) are dead after they are used in the immediately following fmuladd.

RegReductionPQBase::getNodePriority() (in CodeGen/SelectionDAG/ScheduleDAGRRList.cpp) normally returns the SethiUllmanNumber for a node, but there’s a few special cases. ISD::TokenFactor and ISD::CopyToReg return a 0, to push them closer to their uses, and similarly for TargetOpcode::EXTRACT_SUBREG, TargetOpcode::SUBREG_TO_REG, and TargetOpcode::INSERT_SUBREG.

There is also a special case for instructions that are the end of a computational chain, or at the beginning, based on if the instruction has 0 predecessors or 0 successors.

Our fence instruction has 2 (constant) predecessors and no successors. This causes getNodePriority() to think it is the end of a computational chain and return 0xffff instead of the normal SethiUllmanNumber for the node, to try and get the instruction closer to where it’s constants are manifested.

The result is coming out code generation the loads and fmuladds are separated… We end up with a block of 256 loads, the fence instruction that was at the end of the block, then the 256 fmuladd operations.

This causes the live range of all 256 loads to GREATLY increase, increasing register pressure so much that we end up with absolutely awful performance.

We have a local quick fix for this (return the SethiUllmanNumber), but I wanted to get the advice of the list because I’d rather not have local modifications to “target independent” code generation.

Also, it feels like we must be doing something wrong either in describing our target or in later code generation to get this bad a result.

Richard

We have a function that has 256 loads and 256 fmuladds. This block of operations is bounded at either end by an OpenCL barrier (an AMDIL fence instruction). The loads and multiply/adds are ordinarily interleaved… that is, the IR going in to code generation looks like:
  %39 = load float addrspace(3)* getelementptr inbounds ([16 x [17 x float]] addrspace(3)* @sgemm.b, i32 0, i32 0, i32 0), align 4
  %40 = call float @llvm.fmuladd.f32(float %37, float %39, float %c0.037) nounwind
  %41 = load float addrspace(3)* getelementptr inbounds ([16 x [17 x float]] addrspace(3)* @sgemm.b, i32 0, i32 0, i32 1), align 4
  %42 = call float @llvm.fmuladd.f32(float %37, float %41, float %c1.036) nounwind
… and 254 more of these pairs.

%39 and %41 (and 254 more loads) are dead after they are used in the immediately following fmuladd.

RegReductionPQBase::getNodePriority() (in CodeGen/SelectionDAG/ScheduleDAGRRList.cpp) normally returns the SethiUllmanNumber for a node, but there’s a few special cases. ISD::TokenFactor and ISD::CopyToReg return a 0, to push them closer to their uses, and similarly for TargetOpcode::EXTRACT_SUBREG, TargetOpcode::SUBREG_TO_REG, and TargetOpcode::INSERT_SUBREG.
There is also a special case for instructions that are the end of a computational chain, or at the beginning, based on if the instruction has 0 predecessors or 0 successors.

The TargetOpcode checks are likely incorrect because they're not checking getMachineOpcode(), it's just that no one wants to change this nearly obsolete code and hunt down regressions. I would be happy to remove those checks altogether though if they cause problems. In your case I think it's unrelated.

Our fence instruction has 2 (constant) predecessors and no successors. This causes getNodePriority() to think it is the end of a computational chain and return 0xffff instead of the normal SethiUllmanNumber for the node, to try and get the instruction closer to where it’s constants are manifested.
The result is coming out code generation the loads and fmuladds are separated… We end up with a block of 256 loads, the fence instruction that was at the end of the block, then the 256 fmuladd operations.
This causes the live range of all 256 loads to GREATLY increase, increasing register pressure so much that we end up with absolutely awful performance.

We have a local quick fix for this (return the SethiUllmanNumber), but I wanted to get the advice of the list because I’d rather not have local modifications to “target independent” code generation.
Also, it feels like we must be doing something wrong either in describing our target or in later code generation to get this bad a result.

As we discussed off-list, please use -pre-RA-sched=source if possible, and introduce target-specific scheduling in the MachineScheduler pass. There are multiple ways to "plug in" to MachineScheduler.

-pre-RA-sched=source is currently being fixed to work as advertised. A patch is being worked on and expect to see it posted fairly soon. It's still usable as-is, but doesn't always preserve ordering.

-Andy