[RFC] Add XeGPU dialect for Intel GPUs

Motivation

To support high-performance GEMM code generation on Intel GPU, we propose XeGPU dialect. XeGPU dialect provides an abstraction that closely models Xe instructions. XeGPU ops are introduced when a special Xe instruction can’t be expressed by LLVM/SPIR-V dialect, for example, like matrix instruction (AKA DPAS) and 2D block load. It matches the hardware instructions’ semantics including the matrix sizes. XeGPU dialect is similar to NVGPU and AMDGPU dialect and works as a bridge dialect providing target-specific operations on MLIR memref and vector data types.

XeGPU dialect models a subset of Xe GPU’s unique features focusing on GEMM performance. The operations include 2d load, dpas, atomic, scattered load, 1d load, named barrier, mfence, and compile-hint. These operations provide a minimum set to support high-performance MLIR GEMM implementation for a wide range of GEMM shapes. XeGPU dialect complements Arith, Math, Vector, and Memref dialects. This allows XeGPU based MLIR GEMM implementation fused with other operations lowered through existing MLIR dialects.

Example
Below is a short example of how it looks like. It creates 3 tensor descriptors for matrix A, B, and C, followed by a K loop that iteratively loads a block from matrix A, a block from B, does the DPAS, and accumulates to a result vector. After the loop, the result vector is saved to a block matrix C. The “vc” mode allows the XeGPU op to be lowered to SPRI-V VC intrinsic with “Intel Vector Compute” mode.

%4 = xegpu.create_nd_tdesc %arg2[%2, %3] {mode = vc} : memref<1024x1024xf32> -> !xegpu.tensor_desc<8x16xf32>

%5 = xegpu.load_nd %4 {mode = vc} : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32>

%7 = xegpu.create_nd_tdesc %arg0[%2, %c0] {mode=vc}: memref<1024x1024xf16> -> !xegpu.tensor_desc<8x16xf16>

%8 = xegpu.create_nd_tdesc %arg1[%c0, %3] {mode=vc}: memref<1024x1024xf16> -> !xegpu.tensor_desc<16x16xf16>

%6:3 = scf.for %arg3 = %c0 to %c1024 step %c16 iter_args(%arg4 = %5, %subA = %7, %subB = %8) -> (vector<8x16xf32>, !xegpu.tensor_desc<8x16xf16>, !xegpu.tensor_desc<16x16xf16>) {

%9 = xegpu.load_nd %subA {mode=vc, vnni_axis = 1}: !xegpu.tensor_desc<8x16xf16> -> vector<8x8x2xf16>

%10 = xegpu.load_nd %subB {mode=vc, vnni_axis = 0} : !xegpu.tensor_desc<16x16xf16> -> vector<8x16x2xf16>

%11 = xegpu.dpas %9, %10, %arg4 {mode=vc}: vector<8x8x2xf16>, vector<8x16x2xf16>, vector<8x16xf32> -> vector<8x16xf32>

%12 = xegpu.update_nd_offset %subA, [%c0, %c16] {mode=vc}: !xegpu.tensor_desc<8x16xf16> -> !xegpu.tensor_desc<8x16xf16>

%13 = xegpu.update_nd_offset %subB, [%c16, %c0] {mode=vc}: !xegpu.tensor_desc<16x16xf16> -> !xegpu.tensor_desc<16x16xf16>

scf.yield %11, %12, %13: vector<8x16xf32>, !xegpu.tensor_desc<8x16xf16>, !xegpu.tensor_desc<16x16xf16>

}

xegpu.store_nd %6#0, %4 {mode = vc}: vector<8x16xf32>, !xegpu.tensor_desc<8x16xf32>

Reference
XeGPU has been implemented in Intel Extension to MLIR github repo . The high-performance XeGPU based GEMM implementation can be found here, and the test case demonstrated close-to-peak GEMM performance on Intel Max series.

See XeGPU Op definition for details.

2 Likes

I’m not familiar with Xe: is there a set of intrinsics in LLVM like NVVM and AMDGPU? The lowering path isn’t clear to me from your description?

From the RFC on Intel’s MLIR Extensions repo

Proposal

XeGPU dialect models a subset of Xe GPU’s ISA. This is the counterpart of NVGPU and AMDGPU dialects, which provide a bridge dialect in the MLIR gradual lowering. XeGPU dialect works with MLIR memref and vector type and complements Arith, Math, Vector, and Memref dialects. XeGPU operations are introduced when there is a special Xe instruction not modeled by LLVM/SPIR-V dialect, for example, like DPAS and 2D block load. In some cases, one XeGPU op may lower to a sequence of instructions for a dedicated and performance-critical function. For example, create_tdesc is mapped to a fixed sequence of instructions to create an address description.

Notes

Currently, there is no lower-level dialect for the Intel GPU compiler toolchain to represent GPU ops with values based on LLVM data types such as NVVM dialect for the Nvidia GPU compiler toolchain. XeGPU dialect uses LLVM or SPIR-V intrinsic to access advanced intel GPU instructions. When the lower-level software changes, we expect XeGPU lowering passes to change accordingly.

Thanks, I am still not sure on:

for the Nvidia GPU compiler toolchain. XeGPU dialect uses LLVM or SPIR-V intrinsic to access advanced intel GPU instructions.

Does this means that LLVM already has intrinsics for XeGPU?
(I am trying to picture what is this dialect lowered to upstream.)

2 Likes

Can you elaborate? How would this dialect compose with other upstream dialects? In particular:

Presumably you’d like things like linalg.matmul to be lowered to XeGPU? What’s the roadmap for that? And would it be possible to have end-to-end tests upstream?

-Andrzej

The upstream LLVM doesn’t have intrinsic for Xe GPU yet.

The XeGPU op will be first lowered to LLVM dialect with an external function call to an Intel-specific function name, and then lower to LLVM bitcode and translate to SPIR-V binary. These external function names will be recognized by Intel’s low-level SW stack (IGC) as intrinsic.

The current implementation in Intel Extension to MLIR github repo is lowered through SPRI-V dialect which generates SPIR-V IR directly. But when we upstream, we plan to upstream the LLVM dialect lowering path.

XeGPU op interacts with memref and vector data type. Once it sets up the tensor address description with memref, it can load a 2d block from memref to vector. With the data loaded to vector, it can be processed by any other dialect accepting vector data type.

Linalg.matmul would be able to lowered to XeGPU. The lowering could be gradual so it first lowered to a larger size 2d submatrix, and then lowered to the 2d block size to XeGPU level. Internally we are experimenting with the gradual lowering and eventually we would like to upstream the dialect/passes out from the experiemnt.

We have a end-to-end XeGPU based GEMM implementation for 4Kx4K here , and we can upstream that test case to llvm-project/mlir/test/Integration/Dialect at main · llvm/llvm-project · GitHub as part of the XeGPU lowering pass.

Is there a corresponding proposal to add an intel GPU backend to llvm?

Apologies, I should’ve been clearer. I meant interacting with higher level dialects that would target XeGPU. More specifically - do you envisage there being any upstream “clients” (e.g. higher level dialects) targeting the XeGPU dialect? Otherwise we end-up with something that can’t really be used in practice.

Are you able to share the roadmap for that?

-Andrzej

1 Like

Hi Jon, This is a great question! Unfortunately at this stage, there is no such corresponding proposal.

Yes. We are working on lowering from Linalg to XeGPU. I was trying to explain the lowering is gradual and takes a middle step so potential evolves additional dialects/ops/passes for upstream. I can’t share the exact roadmap but we will be happy to post our progress in various MLIR forum.

An example of that strategy is our research project, GitHub - plaidml/tpp-mlir: TPP experimentation on MLIR for linear algebra.

For now, we lower to CPUs via GitHub - libxsmm/libxsmm: Library for specialized dense and sparse matrix operations, and deep learning primitives., but we are working to start lowering to XeGPU (via GitHub - intel/mlir-extensions: Intel® Extension for MLIR. A staging ground for MLIR dialects and tools for Intel devices using the MLIR toolchain.).

As soon as it becomes an upstream dialect, we can drop the IMEX dependency and use MLIR directly.

1 Like

Thanks for the proposal. I’m generally supportive of this as it aligns well with what we have for other targets, including GPU and CPU. However, I also agree with the sentiment that we should have and end-to-end story upstream for this to be useful for the community and justify the maintenance effort. Would it be possible for you to prioritize the LLVM support first?

1 Like

Thanks for your comments. At this point, the portable interface to the Intel GPU driver is SPIR-V, and there is a SPIR-V backend to LLVM. Instead of using LLVM to SPIR-V translator, we could use LLVM/SPIR-V backend to generate SPIR-V binary.

If you refer to the LLVM backend for Intel GPU ISA directly, I agree that ideally we should have LLVM backend for intel GPU, however, it is unlikely for us to get there anytime soon. At this stage, we will have to rely on the LLVM/SPIR-V route.

Would it be feasible to go to the SPIR-V mlir dialect directly and skip LLVM entirely?

1 Like

That’s a technically valid approach. We started with the SPIRV dialect lowering as a prototype, but we prefer the LLVM dialect path so that we can have both options: 1) lowering to SPIR-V binary, and 2) the possibility of evolving to an end-to-end LLVM lowering stack like other HW targets.

The program input to the low-level software (e.g. runtime and driver) is SPIRV binary for intel GPU. The lowering path for XeGPU dialect is through GEN dialect (a link), LLVM IR/bitcode, to SPIRV binary. The XeGPU op is eventually lowered as SPRIV’s external function calls to intel-specific intrinsic. GEN dialect mirrors the NVVM/Rocl dialect to support MLIR gradual lowering. GEN dialect is LLVM leaf-level dialect that intermixes with LLVM data types and exposes Xe ISA at LLVM-IR level. XeGPU sits at a higher level and intermixes with MLIR vector and memref data types.

Intel GPU’s low-level SW stack is still evolving. Although there is no LLVM backend for Xe ISA, the XeGPU and GEN dialects can support upper-level dialects like Linalg lowering to XeGPU. Underneath the XeGPU and GEN dialect, our implementation can choose the lowering path as the low-level software stack evolves, say lowering through SPIRV binary or direct LLVM backend for Xe ISA.

Thanks for elaborating. Since the MPI dialect proposal was just updated, that was a good reminder of the process and criteria, can you try to frame the RFC here to fit a bit what we’re looking for here?
I think many information was provided sparsely across the thread, but it’s worth consolidating in a coherent description right now.

Thanks!

1 Like

[Question] What is the overall goal of the dialect?

[Answer] The XeGPU dialect is aimed to support high-performance GEMM code generation on Intel Xe GPU. It provides an abstraction that closely models Xe instructions. XeGPU ops are introduced when a special Xe instruction can’t be expressed by LLVM/SPIR-V dialect, such as matrix instruction (AKA DPAS) and 2D block load.

[Question] What is the first implementation milestone?

[Answer] The first implementation milestone is to provide initial XeGPU dialect implementation and a high-performance GEMM code example on Intel Data Center GPU Max Series.

[Question] How does it fit into the MLIR dialect ecosystem?

[Answer] XeGPU dialect’s position is similar to NVGPU and AMDGPU dialect in the MLIR dialect ecosystem. It works as a bridge dialect providing target-specific operations on MLIR memref and vector data types.

[Question] Connection: how does it connect to the existing dialects in a compilation pipeline(s)?

[Answer] XeGPU dialect complements Arith, Math, Vector, and Memref dialects. XeGPU op interacts with memref and vector data type. Once it sets up the tensor address description with memref, it loads a 2d block from memref to vector. With the data loaded to vector, it can be processed by any other dialect accepting vector data type. This allows XeGPU-based MLIR GEMM implementation fused with other dialect operations and lowered through existing MLIR dialects.

High-level dialects like Linalg.matmul can be lowered to XeGPU. The lowering would be gradual so it first lowered to a nested SCF loop with vector.transfer_read reads a subview of tensor (a larger size 2d submatrix) to vector followed with vector operations, which are further lowered and blocked to multiple XeGPU operations on smaller size 2d submatrix matching the hardware 2d block size.

The lowering path for XeGPU dialect is through GEN dialect (a link), LLVM IR/bitcode, to SPIRV binary. The XeGPU op is eventually lowered as SPRIV’s external function calls to intel-specific intrinsic. GEN dialect mirrors the NVVM/Rocl dialect to support MLIR gradual lowering. GEN dialect is LLVM leaf-level dialect that intermixes with LLVM data types and exposes Xe ISA at LLVM-IR level. XeGPU sits at a higher level and intermixes with MLIR vector and memref data types.

The other potential lowering is through SPIRV dialect instead of GEN dialect. We prefer the GEN dialect / LLVM IR path so that we can have both options: 1) lowering to SPIR-V binary, and 2) the possibility of evolving to an end-to-end LLVM lowering stack like other HW targets.

[Question] Consolidation: is there already a dialect with a similar goal or matching abstractions; if so, can it be improved instead of adding a new one?

[Answer] There is no other dialect in MLIR fitting this goal.

[Question] Reuse: how does it generalize to similar but slightly different use cases?

[Answer] We have put considerable effort into designing the XeGPU dialect so that it can also cover other Xe ISA variants. Below is a list of design considerations to make it general:

  1. XeGPU OP accepts different matrix sizes, and dialect validation ensures the input sizes match hardware matrix sizes for particular uISA.
  2. The same set of XeGPU Matrix OP supports lowering to either SIMT or SIMD intrinsic. By using a mapping attribute specifying the relation between work items and data elements, we avoid introducing two different sets of operations at the XeGPU dialect level.
  3. Load_nd OP covers loading 1d vector, 2d matrix, and potentially nd tensor. This design makes the OP definition extensible for potential hardware enhancement.

[Question] What is the community of users that it is serving?

[Answer] XeGPU dialect serves the community who aspire to build high-performance GEMM capability by using MLIR infrastructure on intel GPU.

[Question] Who are the future contributors/maintainers beyond those who propose the dialect?

[Answer] Intel engineers would mainly maintain it. Chao Chen @chencha3 at Intel already submitted the initial PR about the dialect definition and test cases. Sang Ik Lee @silee2 and other Intel engineers will further contribute to lowering pass, code examples, and future maintenance.

2 Likes

It seems to me that the MLIR-friendly approach is to generate SPIRV from MLIR: LLVM is just a detour here (actually it is a shortcut in development effort I guess) and wouldn’t provide significant value.
The added value of targeting MLIR SPIRV dialect seems to be one of the reason to take this dialect in-tree: it would help making the SPIRV path more prominent and well supported (and tested) in MLIR.
This lead directly into this question:

The connection to the SPIRV dialect should be a natural answer to fit in MLIR, it’s a bit surprisingly absent here though.

2 Likes