[RFC] Add TargetBuiltins for SPIRV to support HLSL

farzonl · November 25, 2024, 8:08pm

This isn’t an entirely new concept for clang/lib/Basic/Targets/SPIR.cpp since SPIRV64AMDGCNTargetInfo exposes AMDGPU builtins.
What I would like to do is to add a BuiltinsSPIRV.td to be used by SPIRVTargetInfo. What is unique about the builtins I want to add is that they are HLSL lang builtins that are only needed for the SPIRV backend.

Therefore instead of using a SPIRV based naming pattern like __builtin_spirv_distance I would like to add them as __builtin_hlsl_distance. They will be used to bypass a generic implementation using __has_builtin like so

#if (__has_builtin(__builtin_hlsl_distance))
  return __builtin_hlsl_distance(X, Y);
#endif
  return length_vec_impl(X - Y);
}

Using a lang builtin naming scheme means we don’t have to introduce any target architecture specific macros to our hlsl headers.

Example commit of what this would look like: [HLSL] Implement a header only distance intrinsic by farzonl · Pull Request #117240 · llvm/llvm-project · GitHub

arsenm · November 25, 2024, 8:59pm

I find language builtins really distasteful. They unnecessarily bring the language into something that isn’t language specific (e.g. see the mess of duplication we’ve ended up between opencl, cuda, hip and openmp that only deviate by whatever the language prefix is). You’re exporting a spirv intrinsic, it should just be a spirv builtin usable from any language.

Also I really encourage not having intrinsics at all. For something like this, is there a reason you can’t just treat this as an optimization pattern in the backend?

farzonl · November 25, 2024, 9:39pm

You’re exporting a spirv intrinsic, it should just be a spirv builtin usable from any language.

Doing it this way means we need to introduce arch specific macros into our header which was something we wanted to avoid. Moving that logic to the TargetBuiltin was our way of abstracting away that requirement.

is there a reason you can’t just treat this as an optimization pattern in the backend?

Are you asking if you can optimization pattern for the builtin or for IR we are generating for the generic case? I’m going to assume generic case.
The problem with pattern matching the ir below:

%sub.i = fsub <4 x float> %X, %Y
%mul.i = fmul <4 x float> %sub.i, %sub.i
%rdx.fadd.i = tail call float @llvm.vector.reduce.fadd.v4f32(float 0.000000e+00, <4 x float> %mul.i)
%0 = tail call noundef float @llvm.sqrt.f32(float %rdx.fadd.i)

is that since length and distance share the same code with the exception of the fsub the pattern match is order dependent. which isn’t ideal.

Also the SPIRV backend doesn’t have a SPIRVGIsel.td so there is no tablegen matcher to leverage the def : Pat< for this case. I would have to write the pattern matcher in Instruction selector on a case by case basis. Below is an example of what that would look like. And it ends up being about 100 lines of unpleasant code to parse especially because you have to account for G_INTRINSIC_W_SIDE_EFFECTS wrappings of input and outputs.

bool SPIRVInstructionSelector::spvDistancePatternMatch(Register ResVReg,
                                                       const SPIRVType *ResType,
                                                       MachineInstr &I) const {
  if (I.getOpcode() != TargetOpcode::G_FSQRT)
    return false;
  MachineRegisterInfo &MRI = I.getMF()->getRegInfo();
  Register AssignTypeReg = I.getOperand(1).getReg();
  MachineInstr *AssignTypeInst = MRI.getVRegDef(AssignTypeReg);
  if (!AssignTypeInst ||
      AssignTypeInst->getDesc().getOpcode() != SPIRV::ASSIGN_TYPE)
    return false;
  Register DotReg = AssignTypeInst->getOperand(1).getReg();
  MachineInstr *DotInstr = MRI.getVRegDef(DotReg);
  if (!DotInstr || DotInstr->getOpcode() != TargetOpcode::G_INTRINSIC ||
      cast<GIntrinsic>(DotInstr)->getIntrinsicID() != Intrinsic::spv_fdot)
    return false;

  // DotInstr first operand is `G_INTRINSIC` so start at operand 2.
  Register Sub1AssignTypeReg = DotInstr->getOperand(2).getReg();
  MachineInstr *Sub1AssignTypeInst = MRI.getVRegDef(Sub1AssignTypeReg);
  if (!Sub1AssignTypeInst ||
      Sub1AssignTypeInst->getDesc().getOpcode() != SPIRV::ASSIGN_TYPE)
    return false;

  Register Sub2AssignTypeReg = DotInstr->getOperand(3).getReg();
  MachineInstr *Sub2AssignTypeInst = MRI.getVRegDef(Sub2AssignTypeReg);
  if (!Sub2AssignTypeInst ||
      Sub2AssignTypeInst->getDesc().getOpcode() != SPIRV::ASSIGN_TYPE)
    return false;

  Register SubReg1 = Sub1AssignTypeInst->getOperand(1).getReg();
  Register SubReg2 = Sub2AssignTypeInst->getOperand(1).getReg();
  MachineInstr *SubInstr1 = MRI.getVRegDef(SubReg1);
  MachineInstr *SubInstr2 = MRI.getVRegDef(SubReg2);
  if (!SubInstr1 || !SubInstr2 ||
      SubInstr1->getOpcode() != TargetOpcode::G_FSUB ||
      SubInstr2->getOpcode() != TargetOpcode::G_FSUB)
    return false;

  if (SubInstr1->getOperand(1).getReg() != SubInstr2->getOperand(1).getReg() ||
      SubInstr1->getOperand(2).getReg() != SubInstr2->getOperand(2).getReg())
    return false;

  // selectExtInst(ResVReg, ResType, I, CL::distance, GL::Distance);
  ExtInstList ExtInsts = {{SPIRV::InstructionSet::OpenCL_std, CL::distance},
                          {SPIRV::InstructionSet::GLSL_std_450, GL::Distance}};
  MachineBasicBlock &BB = *I.getParent();
  for (const auto &Ex : ExtInsts) {
    SPIRV::InstructionSet::InstructionSet Set = Ex.first;
    uint32_t Opcode = Ex.second;
    if (STI.canUseExtInstSet(Set)) {
      MachineInstrBuilder NewInstr =
          BuildMI(BB, I, I.getDebugLoc(), TII.get(SPIRV::OpExtInst));
      NewInstr.addDef(ResVReg)
          .addUse(GR.getSPIRVTypeID(ResType))
          .addImm(static_cast<uint32_t>(Set))
          .addImm(Opcode)
          .addUse(SubInstr1->getOperand(1).getReg())
          .addUse(SubInstr1->getOperand(2).getReg())
          .constrainAllUses(TII, TRI, RBI);
    }
  }

  if (I.getNumDefs() > 0) { // Make all vregs 64 bits (for SPIR-V IDs).
    for (unsigned Index = 0; Index < I.getNumDefs(); ++Index)
      MRI.setType(I.getOperand(Index).getReg(), LLT::scalar(64));
  }

  // G_INTRINSIC for `fdot` have `G_INTRINSIC_W_SIDE_EFFECTS`
  // using its argument and return registers. Remove these uses.
  auto RemoveAllUses = [&](Register Reg) {
    for (auto &UseMI : MRI.use_instructions(Reg)) {
      UseMI.eraseFromParent();
    }
  };

  // Clean up: erase the original instructions
  I.eraseFromParent();               // Remove FSQRT
  AssignTypeInst->eraseFromParent(); // Remove DOT ASSIGN
  RemoveAllUses(AssignTypeReg);      // Remove intrinsic return side effect uses
  DotInstr->eraseFromParent();       // Remove spv.fdot
  Sub1AssignTypeInst->eraseFromParent();
  RemoveAllUses(Sub1AssignTypeReg); // Remove intrinsic argument side effect
                                    // uses
  SubInstr1->eraseFromParent(); // Remove FSUB
  if (SubInstr1 != SubInstr2) {
    Sub2AssignTypeInst->eraseFromParent();
    RemoveAllUses(
        Sub2AssignTypeReg);       // Remove intrinsic argument side effect uses
    SubInstr2->eraseFromParent(); // Remove FSUB
  }
  return true;
}
bool SPIRVInstructionSelector::spvPatternMatch(Register ResVReg,
                                               const SPIRVType *ResType,
                                               MachineInstr &I) const {

  if (spvDistancePatternMatch(ResVReg, ResType, I))
    return true;
  return false;
}

farzonl · November 25, 2024, 10:04pm

To expand a little further about why a LangBuiltin and not a normal TargetBuiltin.

A __builtin_spirv_distance Would need its own SemaSPIRV.cpp. That doesn’t exist today. Say some new Graphics backend comes along and also has a distance opcode and we want to bring HLSL to then that Backend would also need a SEMA<TARGET>.cpp and that could lead to divergent semantic checks and codegen since there is a similar pattern in CGBuitlins.cpp. Keeping it as a LangBuiltin The sema checks all live in the same place and invoking them is just

 case <NEWGPUIR>::::BI__builtin_hlsl_*:
 case NVPTX::BI__builtin_hlsl_*:
 case AMDGPU::BI__builtin_hlsl_*:
 case SPIRV::BI__builtin_hlsl_*: {
...
}

Same with code in clang/lib/CodeGen/CGBuiltin.cpp. You don’t have to split apart code that is 95% the same.

arsenm · November 25, 2024, 10:10pm

I don’t see the issue. This is just an ordinary optimization problem that the backend is supposed to be good at.

Your argument is essentially SPIRV has missing optimizations and the APIs are bad, so the alternative is to make it a user problem. The backend should be doing this kind of basic optimization work anyway. It doesn’t require expanding the support surface area of the compiler. The APIs need work, but this pattern looks totally handleable from tablegen patterns.

I think trying to turn optimizations into semantics is generally bad. I think this type of pattern should just be matched as an optimization, without requiring new backend support for new operations. If you really really want a source level builtin, it should be a target and language independent builtin.

farzonl · December 2, 2024, 6:22pm

I’m sufficiently convinced you are right here. Move to be language independent in this commit: [HLSL] Implement a header only distance intrinsic by farzonl · Pull Request #117240 · llvm/llvm-project · GitHub

This is a great question for the SPIRV backend owners @michalpaszkowski, @Keenuts, @VyacheslavLevytskyy, @s-perron, what is your preference here? Do you want SPIRV extensions for OpenCL and GLSL exposed to the clang frontend via Target Builtins or should the SPIRV backend invest in a pattern matcher?

s-perron · December 2, 2024, 7:44pm

From a general point of view, I do not usually expect pattern matching to work if the pattern is too complex. The front-end and optimizer could produce so many different patterns, and matching all of them becomes difficult.

In this case, I could easily see the fsub and fmul getting folded into some other arithmetic operations, so I would expect it to fail to match the pattern most of the time.

However, we have to ask how valuable this optimizations is. There is no guarantee that the GPU has a distance instruction. Also if the fsub and fmul can be optimized with other arithmetic code, then that could be even faster.

In this case, I would say do a general expansion without any backend intrinsic. If we notice it causes performance problems, we can optimize it. I think it might be fair to leave the optimization to the driver.

Note: LLPC seems to implement the GLSL distance function as a sqrt followed by a dot product.

farzonl · December 2, 2024, 9:52pm

My first version of this was a dot product specific implementation, but our team decided that the HLSL frontend implementation should match the DXC generated IR to avoid any rounding specific differences. So short of a target builtin our version would need pattern matching to take advantage of dot product specific hardware optimizations.

One thing to keep in mind no target builtin and no pattern matching also means that DXC and clang-dxc when targeting SPIRV would have different codegen, something I was hoping to avoid.

Keenuts · December 3, 2024, 2:26pm

Hello!

For the language intrinsic part, I agree with Matt, I’d be in favor to only expose target-specific instructions using built-ins, but not the other way around.

My first version of this was a dot product specific implementation, but our team decided that the HLSL frontend implementation should match the DXC generated IR to avoid any rounding specific differences.

Isn’t that a codegen/optimization issue? I’d assume if you generate the same operations in the FE as DXC, optimizations shouldn’t be allowed to any reodering/change which impacts the float computations unless allowed by like fastmath?

So short of a target builtin our version would need pattern matching to take advantage of dot product specific hardware optimizations.

Unless I’m missing something, I’d rather to do pattern matching in the backend vs having a language builtin (even if this means writing something specific to deal with SPIR-V assign_type and other annotations).
If we do a language builtin, sure we’d be able to optimize the case where users uses the HLSL distance function, but wouldn’t we miss cases were the user’s code manually computes a vector length?

farzonl · December 3, 2024, 7:21pm

No. Its actually the opposite The testing I’ve done with pattern matching makes it so distance and length GLSL ops are really easy to confuse in the pattern matching especially if the subtraction gets optimized away earlier.

Builtins for all GLSL\openCL ext ops would allow us to emit the ops and not worry about the pattern missing. This is actually why I changed the HLSL length intrinsic in the example pr, because distance(X,Y) defined as length(X-Y) would emit the GLSL length op which would be confusing.

I think this is one of the reasons why openCL exposes the openCL lang builtin versions of these extensions in clang/lib/Sema/OpenCLBuiltins.td.

In the past we have decided that this should be a decision left to the driver. We expect that Intel/Nvidia/AMD have their own pattern matching on the DXIL generated. changing our codegen could break those patterns and delay adoption. So we are going to be intentional to keep what we can the same until we have evidence to do an optimization. Thats also goes for the SPIRV backend and feeds into why there are concerns about pattern matching failing.

arsenm · December 20, 2024, 5:56am

This is just a missing pattern. Optimizations should not be mandated semantics.

This is an implementation detail of providing all of the standard declarations for the functions without requiring textual parse of a header. This has nothing to do with pattern matching or optimization.

This is a backwards way of thinking about it. By expanding the set of operations, you’ve mandated an increase in the complexity of the backend. Mandating specific optimizations is an anti-pattern.

farzonl · December 20, 2024, 6:16am

I think this is the fundamental disconnect. I don’t see these as optimizations. Sure perf might be worse if pattern matching fails, but I see this as a correctness problem. DXC uses the SPIRV GLSL extentions. Many of the shader transpilers also do these same transformations of HLSL apis to GLSL extentions. I have concerns that if pattern matching fails then the burden of correctness moves from the graphics driver to the HLSL compiler.

arsenm · December 20, 2024, 6:33am

It can never be wrong to expand an operation in terms of more primitive operations. The set of extensions that exist in X or Y place do not matter, you cannot be required to use them

farzonl · December 20, 2024, 6:03pm

I don’t mean this in terms of is the transformation valid, just in terms of being consistent with other implementations of the language. I want this consistency to match the precision across implementations. An example might help here. Below are three ways to get length in a shader targeting spirv. Calling the hlsl length intrinsic and sqrt(dot(a,a)). Produce the same result. But the summation version starts to diverge from the other two around the 9th decimal point. I don’t know if that type of difference will be important in the long run, but it is one of those things that if we exposed the spirv extensions to the frontend we could let the gpu driver figure out instead of the language.

void printMain()
{
    float4 a = float4(-0.001840781536884605884552,  0.033134069293737411499023,  0.988499701023101806640625,  1.123456789987654321123456);
    
    float aLength = length(a);
    float bLength = 0;
    float4 bsquared = a*a;
    for(int I = 0; I < 4; I++) {
        bLength += bsquared[I];
    }
    bLength = sqrt(bLength);
    float cLength = sqrt(dot(a,a));
    print("a vector: %f,%f,%f,%f\n", a[0],a[1],a[2],a[3]);
    print("a length via dot product    : %f\n", aLength);
    print("a length via loop sumation. : %f\n", bLength);
    print("a length via length function: %f", cLength);
}

arsenm · December 21, 2024, 12:22pm

If you don’t get the same result in the fused version, it’s not a valid transformation (without fast math). If it’s permissible to produce different results without fast math on different targets, the operation isn’t really well formed. For generic code to operate on generic operations, there needs to be a fixed behavior

farzonl · December 22, 2024, 12:34am

Thank you for all the feedback. Especially from you Matt! I think it makes the most sense to table this plan for now. I have some things to consider. Maybe it makes the most sense to do sqrt(dot(a,a)) in the frontend then it shouldn’t matter if we emit the opengl extention or not. Or maybe it makes the most sense to do nothing here and formalize the directX implementation as the standard so that the percision behavior is the same between the directX and spirv backends. There is enough questions here that I think We will have to revisit this after the holidays.

And I also just confirmed with fast math on the behavior is the same across all three implementations and our plan for the clang-dxc driver is for fast-math to be on by default soon so my concerns about percision might be solved shortly anyways.

That said there seems to be one areas of agreement which I would like to hear from SPIRV backend owners @michalpaszkowski and @VyacheslavLevytskyy on. Do you two even want target builtins for spirv? If its a no for you two then that will give clarity on how to proceed.

consensus area 1: added target builtins

VyacheslavLevytskyy · December 22, 2024, 12:32pm

I didn’t find much difference in points from Matt, Steven and Nathan, so I abstained from adding one more identical opinion. It looks like there is already some kind of a consensus. I don’t like the idea of language builtins and intrinsics as well, and a possible intent to match DXC impementation details doesn’t sound like a strong argument to me. Both (1) optimization/pattern matching, and (2) target (language independent) builtins look eligible and feasible.

Personally, I regard builtins to be more like the last measure in this case. I’d anyway start with general expansion and analyze performance, hoping that driver can cope with the problem. If arguable and measurably not, an optimization pass is a more general (hopefully, reusable) solution. I’m ok with builtins as a manual optimization, I just doubt that builtins should be a starting point. In other words, I’m not against builtins, but reusable features excite me much more than yet another special case.

Expressing this in SPIR-V Backend specific terms, I’d prefer a way where we invest into general SPIR-V Backend features, reusability and quality, including better pattern matching. We’ve been rewriting and improving that part of the business logic during the year, it makes sense to proceed.

farzonl · December 22, 2024, 11:45pm

This sounds good. My plan will be to put in the infrastructure to do pattern matching and target builtins. Further we will try to make pattern matching the first choice for expansion. I don’t have any hard set rules yet for picking one or the other so it will be a bit of trial and error to start.

There is a bunch of annoying setup pieces to turn on target builtins, but I think setting up both builtins and pattern matching give us flexibility if one makes more sense for the use case.

In the interim it also seems perfectly valid to do nothing for the spirv case, which Is how I will start defining the hlsl api tickets.

Also I’ve already abandoned the idea of a lang builtin for this.

arsenm · December 23, 2024, 9:55am

Another sidenote, OpenCL’s vector length function which is defined to avoid overflow and underflow, so it isn’t the simple vector reduction + sqrt and includes some range checks with scaling

Topic		Replies	Views
[SPIR-V] SPIR-V in LLVM LLVM Dev List Archives	45	352	January 9, 2018
[RFC] All the math intrinsics IR & Optimizations	20	1652	August 17, 2024
[RfC] A proposal of adding SPIR-V Toolchain in Clang LLVM Dev List Archives	28	150	October 30, 2018
[RFC] Stripping unusable intrinsics LLVM Dev List Archives	45	280	December 30, 2014
[RFC] Proposal for Adding SPIRV Target LLVM Dev List Archives	19	170	July 8, 2015

[RFC] Add TargetBuiltins for SPIRV to support HLSL

Related topics