Motivation
To support high-performance GEMM code generation on Intel GPU, we propose XeGPU dialect. XeGPU dialect provides an abstraction that closely models Xe instructions. XeGPU ops are introduced when a special Xe instruction can’t be expressed by LLVM/SPIR-V dialect, for example, like matrix instruction (AKA DPAS) and 2D block load. It matches the hardware instructions’ semantics including the matrix sizes. XeGPU dialect is similar to NVGPU and AMDGPU dialect and works as a bridge dialect providing target-specific operations on MLIR memref and vector data types.
XeGPU dialect models a subset of Xe GPU’s unique features focusing on GEMM performance. The operations include 2d load, dpas, atomic, scattered load, 1d load, named barrier, mfence, and compile-hint. These operations provide a minimum set to support high-performance MLIR GEMM implementation for a wide range of GEMM shapes. XeGPU dialect complements Arith, Math, Vector, and Memref dialects. This allows XeGPU based MLIR GEMM implementation fused with other operations lowered through existing MLIR dialects.
Example
Below is a short example of how it looks like. It creates 3 tensor descriptors for matrix A, B, and C, followed by a K loop that iteratively loads a block from matrix A, a block from B, does the DPAS, and accumulates to a result vector. After the loop, the result vector is saved to a block matrix C. The “vc” mode allows the XeGPU op to be lowered to SPRI-V VC intrinsic with “Intel Vector Compute” mode.
%4 = xegpu.create_nd_tdesc %arg2[%2, %3] {mode = vc} : memref<1024x1024xf32> -> !xegpu.tensor_desc<8x16xf32>
%5 = xegpu.load_nd %4 {mode = vc} : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32>
%7 = xegpu.create_nd_tdesc %arg0[%2, %c0] {mode=vc}: memref<1024x1024xf16> -> !xegpu.tensor_desc<8x16xf16>
%8 = xegpu.create_nd_tdesc %arg1[%c0, %3] {mode=vc}: memref<1024x1024xf16> -> !xegpu.tensor_desc<16x16xf16>
%6:3 = scf.for %arg3 = %c0 to %c1024 step %c16 iter_args(%arg4 = %5, %subA = %7, %subB = %8) -> (vector<8x16xf32>, !xegpu.tensor_desc<8x16xf16>, !xegpu.tensor_desc<16x16xf16>) {
%9 = xegpu.load_nd %subA {mode=vc, vnni_axis = 1}: !xegpu.tensor_desc<8x16xf16> -> vector<8x8x2xf16>
%10 = xegpu.load_nd %subB {mode=vc, vnni_axis = 0} : !xegpu.tensor_desc<16x16xf16> -> vector<8x16x2xf16>
%11 = xegpu.dpas %9, %10, %arg4 {mode=vc}: vector<8x8x2xf16>, vector<8x16x2xf16>, vector<8x16xf32> -> vector<8x16xf32>
%12 = xegpu.update_nd_offset %subA, [%c0, %c16] {mode=vc}: !xegpu.tensor_desc<8x16xf16> -> !xegpu.tensor_desc<8x16xf16>
%13 = xegpu.update_nd_offset %subB, [%c16, %c0] {mode=vc}: !xegpu.tensor_desc<16x16xf16> -> !xegpu.tensor_desc<16x16xf16>
scf.yield %11, %12, %13: vector<8x16xf32>, !xegpu.tensor_desc<8x16xf16>, !xegpu.tensor_desc<16x16xf16>
}
xegpu.store_nd %6#0, %4 {mode = vc}: vector<8x16xf32>, !xegpu.tensor_desc<8x16xf32>
Reference
XeGPU has been implemented in Intel Extension to MLIR github repo . The high-performance XeGPU based GEMM implementation can be found here, and the test case demonstrated close-to-peak GEMM performance on Intel Max series.
See XeGPU Op definition for details.