[RFC] Add GEN dialect for Intel GPUs

rengolin · February 26, 2024, 11:41am

If you ever find one, let me know!

alexbaden · February 27, 2024, 4:57pm

I would like to try and address this, but admittedly I am a newcomer to MLIR and may have some wires crossed, so please go easy on me. Currently the GEN dialect is being used as part of the lowering pipeline in OpenAI Triton to allow the Triton language to support Intel GPUs. The comparison to NVVM comes from there, but is also structural - many of the ops are identical between NVVM and GEN. However, unlike NVVM where the lowering to LLVM generates intrinsics specific to the PTX backend, GEN typically outputs an OpenCL function call which is left as an external function call in the generated SPIRV kernel and handled by the Intel Graphics Compiler (IGC) downstream (either resolved or replaced with equivalent functionality). So, I think you can make the comparison both ways but I understand your argument and wanted to unpack it a bit more, if nothing else for my own understanding of how the MLIR community expects dialects to fit together.

The compilation flow for Intel GPU flow (at least when LLVM is involved in our experience) is to generate SPIRV kernels and then compile those kernels using some wrapper software package around the IGC, like the Level0 driver or SYCL. IGC is designed to be an OpenCL compiler, so a lot of extensions are represented by OpenCL functions - though we understand custom SPIRV extensions are also possible. IGC has a bunch of builtins, and many if not all the functionality encapsulated in the GEN dialect could be represented with an IGC builtin (which just takes us conceptually lower in the IGC pass pipeline), but the IGC team makes no stability or portability guarantees for the builtins and they are not to be used externally. To summarize - we have LLVM IR → SPIRV w/ OpenCL Function Calls → Intel Graphics Compiler (SPIRV with those calls resolved and many optimizations applied) → GPU Module/Kernel (see https://spec.oneapi.io/level-zero/latest/core/PROG.html#modules for more details).

In a perfect world it seems like a native LLVM backend (probably roughly encapsulating IGC) with the IGC builtins represented as intrinsics would give perfect parity with the NVVM dialect and make GEN a no-brainer inclusion (provided of course we used the intrinsics and not OpenCL function calls, though perhaps the difference is arbitrary?). But, we are hemmed in by both lower level constraints (the rigid L0/IGC software stack not being a part of upstream LLVM right now) and upper level ones (to fit the existing Triton flow, we need a dialect that is within the existing LLVMIR dialect so we can map NVVM calls to our own GPU functions within their codebase). I am interested in ways to incrementally improve the situation, while also being mindful of the necessity to make progress using the work we already have provided it does not irrecoverably cause harm.

I don’t think a native LLVM backend for Intel GPU is tenable right now given existing constraints. We had previously tried creating lowerings to SPIRV for Triton’s internal LLVMIR based Dialects, but that proved incredibly difficult both to get the lowering right (you really just want to use the SPIRV backend in LLVM or SPIRV translator, otherwise you’re attempting to short circuit or redo a lot of difficult existing work) and to keep in sync with upstream, so I don’t think that approach is tenable either. We have discussed a more generic Intel GPU dialect, preferably with its own set of intrinsics (e.g. SPIRV builtins) or function calls (e.g. OpenCL) mapped to an existing LLVM backend (like SPIRV) but figuring out where this dialect would sit is tricky. For current use cases it just becomes additional indirection; but could improve readability and possibly better compose with other dialects (like SPIRV) for use cases outside of OpenAI’s Triton implementation where LLVMIR lowering is not required.

mehdi_amini · February 27, 2024, 9:33pm

I am not sure if we’re talking about the same thing, but a “native backend” would mean to me that you would open-source the IGC SPIRV compiler and it would target the actual ISA and not SPIRV.

Basically that mean the implementation of a SPIRV compiler and not a compiler outputting SPIRV, as I described before for where AMD/NVIDIA are:

All the considerations about OAI Triton, etc. are only tangential to the discussion, i 's a useful reference point, but very minor overall when it comes to discuss “how do we envision the stack when we build a compiler targeting SPIRV when we come from higher-level MLIR dialects”.

You’re presenting this as a constraint, I disagree with this characterization. It is mostly a minor convenience and you could target SPIRV dialect directly from the TritonGPU IR.
For all the arithmetic and vector instructions these would be provided by MLIR already. For GEN, well it’s not more work: you have to write the conversion already!
For Triton-specific constructs that are directly targeting LLVM and aren’t using upstream lowering, then you’d need to implement the SPIRV lowering, and that’s extra work, but that seems quite bounded to me. How many of such ops there is? How much work is this really?

So overall what I would characterize as “saving a little bit of effort in Triton” isn’t a compelling argument to me for any important MLIR upstream consideration.

sommerlukas · February 29, 2024, 5:01pm

All the considerations about OAI Triton, etc. are only tangential to the discussion, i 's a useful reference point, but very minor overall when it comes to discuss “how do we envision the stack when we build a compiler targeting SPIRV when we come from higher-level MLIR dialects”.

While OAI Triton might only be a tangential example, I think it points to a bigger point here:
Compilation flows targeting Nvidia or AMD GPUs through MLIR will eventually always target the LLVM dialect, as it seems unlikely someone would build a native backend for those targets (PTX/GCN) in MLIR at this point.

So such compilation flows will always have a lowering to the LLVM dialect, potentially also a direct lowering to LLVM dialect for custom dialects in those flows.

Forcing the same compilation flow to go through the SPIR-V dialect when targeting Intel GPU means that all lowerings for the custom dialect that directly target the LLVM dialect would need to be rewritten for the SPIR-V dialect and no reuse would be possible.

While I understand your motivation with regard to the SPIR-V dialect, IMHO MLIR should not force users to choose a route that hinders reusability of lowerings etc.

I think MLIR should not force compilation flows to go down a specific, different route for GPUs of one vendor when compared to GPUs of other vendors but should give them choice. In particular if targeting SPIR-V through LLVM would also benefit other, non-MLIR-based compilation flows.

stellaraccident · February 29, 2024, 5:27pm

I agree. I don’t know if the is a term for this, but I can it the OSS standardization tax: you buy into a standard for your business – which is good. Kudos. But then it increases the surface area of conversations you have to have and people you have to convince in OSS projects regarding overlap in how the standard is leveraged. The nature of systems development means that there is almost always an early adopter of a standards based technology in a software project, and that can set the tone for how it is used long before more varied case shows up. And now you’re not just paying any overhead that comes from using the standard but also having a hard time integrating if your case isn’t exactly what came before (or has practical concerns that force it down a different path, even if just for a while).

In this case (again, as a sponsor of some of the in tree SPIR-V tooling), I think some of that may help you at some point, but it is your product and your choice. At this point in time, I pattern match that Intel built their GPU backend using SPIR-V technology – but I fully understand wanting to be on the majority path with the implementation. That’s great and enough. Maybe the example will attract others to the approach and in the future we can see our way to a bit less fragmentation and more common tooling flows. But these things play out over many years. I don’t see a need to converge right now on how it is implemented with respect to the ideal of a high level SPIR-V based compiler.

Not sure if that makes sense. But I’m a strong proponent of standards based solutions in this space, and I’m happy if you got half way, and I think it is perfectly reasonable to look at it currently like its in-class peers and focus on a similar tooling flow. You can’t get everywhere all at once and have to prioritize.

mehdi_amini · February 29, 2024, 5:35pm

2 points:

I believe I addressed this already: please re-read the end of my previous post you’re quoting here.
The amount of reuse is really just pointing at missing abstractions where OAI Triton (and other MLIR-based compiler) are overly fitted to LLVM, which is exactly what “developing a native SPIRV target in MLIR” would contribute to address!
Part of this exist already: for example Triton uses the arithmetic dialects and builtin tensor, vector, and integer types: this should be already able to retarget to SPIRV. Anything else is just “TODO”. MLIR is built incrementally when the need arise: what we find here is the first case of hitting the missing abstractions.

I understand why in your position you can see it from this angle, but from a project point of view I disagree with this statement: we may or may not want to develop the SPIRV Native path, but this is a choice we have here and “different route for GPU” is definitely something that can be justified (I think I elaborated enough on this).

sommerlukas · March 5, 2024, 4:48pm

Thanks for your feedback Mehdi!

Based on this and in order to make some progress on the topic, maybe the position of the GEN dialect needs to be reconsidered.

Instead of binding the GEN dialect to the LLVM dialect and making it an extension of the LLVM dialect as initially proposed in this RFC, the GEN dialect could be a separate dialect.

As such, it would be in a similar position as the arithmetic dialects and close the abstraction gap mentioned in the quote above. It would expose functionality of a particular hardware (Intel GEN GPU) as operations in the dialect, and those capabilities can be targeted through LLVM as well as SPIR-V.

Instead of a translation of the operations to LLVM IR, the re-designed GEN dialect would come with lowerings to LLVM and SPIR-V dialect, similar to the arithmetic dialects.

This way, we can avoid to overly fit GEN to LLVM and we would ensure that the SPIR-V native path remains intact and can be targeted through the GEN dialect.

What do you (and of course all others!) think about this approach?

mehdi_amini · March 5, 2024, 8:12pm

If this was possible, that would seem ideal to me. Have you look into the feasibility already?

sommerlukas · March 5, 2024, 10:58pm

I first wanted to see whether this approach would be agreeable in general.

We don’t expect any major issues, but we’ll do some experimentation with this approach and then return to this conversation here with the result of the experimentation, so there would also be some code to look at.

victor-eds · April 5, 2024, 9:06am

Based on the great feedback we have received in this conversation and the ODM, we have changed the design of the GEN dialect significantly.

An important concern was that the originally proposed design for GEN could not lower to the SPIR-V dialect and that the route via SPIR-V dialect should be kept intact for targeting Intel GPU.

We have been working for the past few weeks to get a new version of the GEN dialect addressing these concerns. The new GEN dialect is a standalone dialect, no longer dependent on LLVM, that can be converted both to the LLVM dialect and also to the SPIR-V dialect now.

To demonstrate the feasibility of such a dialect that abstracts above these two leaf dialects (similar to arith or the ptr dialects) and provides lowerings to both, we have defined a minimal set of operations to start working on upstreaming GEN.

The operations focus on determining the position of a work-item (=thread) in the execution grid.

As a proof of concept, we are going to show two different compilation paths for the gen.local_id operation from MLIR GEN dialect to SPIR-V.

First, the gen.local_id operation is used to query the local id (corresponding to gpu.thread_id) along a given dimension, which is passed as an argument of type i32. This and similar operations are defined as index operations, as both spirv32 and spirv64, i.e., targets with Physical32 and Physical64 memory models, should be supported.

As a motivating example, we will use the following func.func function returning the local id.

func.func @get_local_id() -> index {
  %c0_i32 = arith.constant 0 : i32
  %0 = gen.local_id %c0_i32
  func.return %0 : index
}

LLVM Compilation Path

For the compilation path via the LLVM SPIR-V backend, we first show how the above code can be represented solely in the LLVM dialect. This can be achieved running the following pipeline:

mlir-opt input.mlir --pass-pipeline="builtin.module(convert-gen-to-llvm,convert-func-to-llvm,convert-arith-to-llvm,canonicalize)"

The gen.local_id is represented in LLVM as a call to a get_local_id builtin (also used by clang and defined here) receiving an i32 input and returning i32 for spirv32 targets and i64 for spirv64 targets. We will assume a 64-bit target in this example:

llvm.func spir_funccc @_Z12get_local_idj(i32) -> i64
llvm.func @get_local_id() -> i64 {
  %0 = llvm.mlir.constant(0 : i32) : i32
  %1 = llvm.call @_Z12get_local_idj(%0) : (i32) -> i64
  llvm.return %1 : i64
}

As usual, we can translate this to LLVM IR as follows:

mlir-translate --mlir-to-llvmir

Obtaining a simple LLVM IR module:

source_filename = "LLVMDialectModule"

declare spir_func i64 @_Z12get_local_idj(i32)

define i64 @get_local_id() {
  %1 = call i64 @_Z12get_local_idj(i32 0)
  ret i64 %1
}

!llvm.module.flags = !{!0}

!0 = !{i32 2, !"Debug Info Version", i32 3}

Which we can use to generate an SPIR-V module using LLVM’s SPIR-V backend:

llc -mtriple=spirv64

Here we can see how the backend handles the get_local_id builtin call creating a __spirv_BuiltInLocalInvocationId global variable of <3xi64> type (64-bits as this is a 64 bit target) and Import storage class corresponding to the LocalInvocationId SPIR-V builtin. Then, the call itself is replaced by an OpLoad operation loading the global variable and an OpCompositeExtract operation extracting the required dimension of the vector.

        OpCapability Kernel
        OpCapability Addresses
        OpCapability Int64
        OpCapability Linkage
        %1 = OpExtInstImport "OpenCL.std"
        OpMemoryModel Physical64 OpenCL
        OpSource Unknown 0
        OpName %8 "get_local_id"
        OpName %7 "__spirv_BuiltInLocalInvocationId"
        OpDecorate %8 LinkageAttributes "get_local_id" Export
        OpDecorate %7 Constant
        OpDecorate %7 LinkageAttributes "__spirv_BuiltInLocalInvocationId" Import
        OpDecorate %7 BuiltIn LocalInvocationId
        %2 = OpTypeInt 64 0
        %3 = OpTypeFunction %2
        %4 = OpTypeInt 32 0
        %5 = OpTypeVector %2 3
        %6 = OpTypePointer Input %5
        %7 = OpVariable %6 Input
        %8 = OpFunction %2 None %3              ; -- Begin function get_local_id
        %11 = OpLabel
        %9 = OpLoad %5 %7 Aligned 1
        %10 = OpCompositeExtract %2 %9 0
        OpReturnValue %10
        OpFunctionEnd

SPIR-V Dialect Compilation Path

For the compilation path via the SPIR-V dialect, we are also going to show the outcome at different compilation stages.

First, let’s see the result of running the following pipeline (wrapping the resulting module in a SPIR-V module for future handling):

mlir-opt input.mlir --pass-pipeline="builtin.module(convert-gen-to-spirv,convert-func-to-spirv,convert-arith-to-spirv,canonicalize)"

As easy as that, our gen.local_id operation lowers to a spirv.VectorExtractDynamic operation receiving the i32 dimension and the previously loaded vector<3xi32> global variable representing the LocalInvocationId builtin. To keep the conversion pass simple, we use a spirv.VectorExtractDynamic operation for now with equivalent semantics to the OpCompositeExtract one above (requiring constant indices), thus enabling a variable inputs. Mangling of this builtin symbol has been configured to match what we obtain above, as that is not mandated by any SPIR-V standard. As we can see here, we are using i32 for the operation result type as our target is 32 bits, as mandated by the Physical32 addressing model:

spirv.module @__spv__kernels Physical32 OpenCL requires #spirv.vce<v1.0, [Addresses, Kernel], []> attributes {spirv.target_env = #spirv.target_env<#spirv.vce<v1.2, [Addresses, Kernel], []>, #spirv.resource_limits<>>} {                                 
  spirv.GlobalVariable @__spirv_BuiltInLocalInvocationId built_in("LocalInvocationId") : !spirv.ptr<vector<3xi32>, Input>                                                                                                                                  
  spirv.func @get_local_id() -> i32 "None" {                                                                                                                                                                                                               
    %cst0_i32 = spirv.Constant 0 : i32                                                                                                                                                                                                                     
    %__spirv_BuiltInLocalInvocationId_addr = spirv.mlir.addressof @__spirv_BuiltInLocalInvocationId : !spirv.ptr<vector<3xi32>, Input>                                                                                                                     
    %0 = spirv.Load "Input" %__spirv_BuiltInLocalInvocationId_addr : vector<3xi32>                                                                                                                                                                         
    %1 = spirv.VectorExtractDynamic %0[%cst0_i32] : vector<3xi32>, i32                                                                                                                                                                                     
    spirv.ReturnValue %1 : i32                                                                                                                                                                                                                             
  }                                                                                                                                                                                                                                                        
}

If we translate this to SPIR-V using mlir-translate as usual (and spirv-dis to obtain the textual representation):

mlir-translate --no-implicit-module --serialize-spirv spv.mlir | spirv-dis

The code we obtain this way is pretty similar to the one above, modulo some extra decorations and capabilities added by the SPIR-V backend and the aforementioned differences.

               OpCapability Addresses
               OpCapability Kernel
               OpMemoryModel Physical32 OpenCL
               OpName %__spirv_BuiltInLocalInvocationId "__spirv_BuiltInLocalInvocationId"
               OpName %get_local_id "get_local_id"
               OpDecorate %__spirv_BuiltInLocalInvocationId BuiltIn LocalInvocationId
       %uint = OpTypeInt 32 0
     %v3uint = OpTypeVector %uint 3
%_ptr_Input_v3uint = OpTypePointer Input %v3uint
%__spirv_BuiltInLocalInvocationId = OpVariable %_ptr_Input_v3uint Input
          %5 = OpTypeFunction %uint
     %uint_0 = OpConstant %uint 0
%get_local_id = OpFunction %uint None %5
          %7 = OpLabel
          %9 = OpLoad %v3uint %__spirv_BuiltInLocalInvocationId
         %10 = OpVectorExtractDynamic %uint %9 %uint_0
               OpReturnValue %10
               OpFunctionEnd

Conclusion

The above example shows that both lowerings are feasible.

We have pushed a draft PR with the current implementation of the dialect (see).

If the direction of this design of GEN finds agreement here, we will start the upstreaming through concrete PRs in the next step.

kuhar · April 7, 2024, 10:17pm

Thanks for the thorough walkthrough, @victor-eds! These lowering paths make sense to me.

victor-eds · April 15, 2024, 1:46pm

I put the first PR for GEN dialect up. Only dialect and operations definition, no conversion yet. PRs for conversion passes will follow.

Feel free to comment and thanks for your suggestions!

Hardcode84 · April 15, 2024, 7:00pm

    gen.local_id: query a work-item's local id
    gen.work_group_id: query the id of a work-item's work-group
    gen.work_group_size: query the size of a work-item's work-group
    gen.num_work_groups: query the number of work-groups
    gen.barrier: work-group barrier
    gen.sub_group_shuffle: sub-group shuffle

These specific ops are already present in GPU dialect, I believe, including SPIR-V lowering suitable for Intel GPUs. It’s not clear for me why do we need to duplicate ops definition and lowering in another place. It will be more productive to add llvm lowering, suitable for LLVM SPIR-V backend/Khronos SPIR-V translator, for the existing GPU dialect ops.

mehdi_amini · April 16, 2024, 2:38pm

The GPU dialect defines a thread_id op which takes an attribute, the current PR for GEN defines local_id as taking an SSA Value. It would be useful to understand why these exists and why they diverge from the GPU dialect indeed.

pengtu · April 18, 2024, 4:21pm

The goal of GEN is to provide a MLIR LLVM target with access to Intel GPU instructions/intrinsics similar to NVVM and ROCDL. For MLIR developers who prefer to stay in the MLIR LLVM target flow or to reuse their codes targetting NVVM/ROCDL, supporting these operations directly in GEN reduces the complexity of bringing a separate SPIRV target and the corresponding dependency. A skilled SPIRV compiler developer can use the GPU to SPIRV functions as you stated.

kuhar · April 18, 2024, 5:31pm

I think the suggestion was that if you use the ops from the gpu dialect, you can have custom lowering to llvm without plumbing through custom ops. Similarly, you probably don’t plan to plumb through gen.addf and can use arith.addf instead. You could have a addGPUToGENLLVMPatterns for the llvm lowering path and the spir-v path would be free. If you go with custom ops, you need both the ops and fundamentally the same patterns, but then you also need to write the SPIR-V conversion.

Would it make sense to limit GEN to only the ops that are specific to your hw / implementation? For example, I could see this making sense for subgroup shuffles where you actually know which variants are performant on Intel GPUs.

pengtu · April 18, 2024, 7:35pm

We are just adding GPU built-ins lowering to GEN on the LLVM path to match what NVVM and ROCDL provided Nvidia and AMD GPUs on the LLVM path. It benefits the developers using LLVM path while people on the SPIRV path can continue to use the GPU to SPIRV built-in lowering. In our next set of PRs, we will add the Intel GPU specific XMX instructions to support high performance MMA. That parts will benefit both the SPIRV and LLVM path developers.

sommerlukas · April 18, 2024, 7:40pm

Thanks for your feedback so far everyone!

The initial intention of this work was to enable people to target SPIR-V devices through the LLVM dialect and the LLVM SPIR-V backend. We believe there is valid reasons for users to use this route, e.g., to re-use existing lowerings to LLVM dialect for other backends (as is the case for Triton) or to integrate with other frontends using LLVM IR directly (e.g., OpenCL), and it seems there is interest in general in this route.

This should however not be a replacement of the SPIR-V dialect, just an alternative, so we re-designed GEN to also provide lowerings to SPIR-V when concerns about the parity with SPIR-V dialect were raised earlier in the discussion.

I agree that this leads to some duplication with the GPU-to-SPIRV lowering for the operations added to GEN in the first PR. This is probably an artifact of those operations being rather generic and not Intel GPU specific. We should probably distinguish these generic operations from more Intel GPU-specific operations that we may add in future steps.

For these generic operations, there are multiple options on how to handle their lowering (which does not yet exist) to LLVM dialect, targeting the LLVM SPIR-V backend.

Keep them in the GEN dialect, provide lowerings to LLVM and SPIR-V and accept the duplication
Keep them in the GEN dialect, but only provide lowerings to LLVM for these operations, as GPU-to-SPIRV can be reused for these. Future Intel GPU-specific operations added to GEN would still provide lowerings to LLVM and SPIR-V to keep feature parity of the SPIR-V dialect.
Not put those operations in any dialect at all. Instead, provide a lowering from GPU to LLVM, targeting the LLVM SPIR-V backend, as a pass that creates LLVM intrinsic calls.
Separate these generic operations out into a dialect akin ROCDL/NVVM and provide lowerings from GPU to that and to LLVM intrinsics,similar to what is done for ROCDL/NVVM. The name of such a dialect would still need to be decided. Future Intel GPU-specific operations could still be put into the GEN dialect with lowerings to LLVM and SPIR-V.

We would be interested to hear your opinions on which of those options you would prefer (or something else entirely). If we find consensus, we would be ready to implement and upstream such consensus.

From our point of view, this additional route to SPIR-V via LLVM would be a valuable addition to the MLIR ecosystem if we find consensus.

sommerlukas · April 25, 2024, 3:35pm

@mehdi_amini @kuhar @Hardcode84 We’d appreciate your opinion on what you think the best option is here, to guide the implementation.

kuhar · April 25, 2024, 6:51pm

Option (3) aligns with what I was suggesting: [RFC] Add GEN dialect for Intel GPUs - #56 by kuhar. My preference is to start with something sufficient to target your device and only add ops when it’s required (either by the underlying ISA or ergonomics). You can always add ops later if you discover that going from common dialects like gpu/vector/arith to llvm instrinsics is insufficient.

Topic		Replies	Views
[RFC] Add XeGPU dialect for Intel GPUs MLIR	21	12604	February 22, 2024
[RFC] Proposal for new XeVM dialect MLIR mlir	12	1116	July 3, 2025
Spirv dialect (and SPIR-V target) design decision? MLIR	1	1195	November 17, 2022
[RFC] Upstreaming a proper SPIR-V backend LLVM Dev List Archives	33	894	March 14, 2021
[RFC] Add NV-GPU dialect (HW specific extension of GPU dialect for Nvidia GPUs) MLIR	21	1873	April 15, 2022

[RFC] Add GEN dialect for Intel GPUs

LLVM Compilation Path

SPIR-V Dialect Compilation Path

Conclusion

Related topics