[RFC] Add GEN dialect for Intel GPUs

etiotto · February 2, 2024, 10:14pm

Motivation

The MLIR ecosystem contains the NVVM dialect which extends the LLVM dialect to provide operations useful for programming NVIDIA GPUs. Similarly the ROCDL dialect extends the LLVM dialect and provides operations for programming AMD GPUs.

In order to provide similar functionality for programing Intel GPUs, we propose the addition of a new LLVM target dialect (GEN) to act as a counterpart to the NVVM and ROCDL dialects. The GEN dialect will provide operations for exposing to the MLIR ecosystem selected Xe ISA (codename GEN) assembly instructions. Hierarchically, the GEN dialect sits below the XeGPU dialect and it is our intention to supports lowering from the latter to the former where it makes sense.

Initially the GEN dialect will contain operations to:

query GPU properties such as threads ID, block ID, block and grid dimensions, etc…
emit barrier and group shuffle operations
emit instructions useful to access systolic array HW for matrix operations

Example

The GEN dialect has been implemented in a branch in the Intel LLVM monorepo.
Below is an example of how an operation in the GEN dialect looks like:

def GENX_BarrierOp : GENX_Op<"barrier"> {
let summary = "Workgroup barrier";

string baseDescription = [{
  The `genx.barrier` operation performs a workgroup barrier and ensures all outstanding memory
  transaction using local or global memory are complete.
}];

string llvmBuilder = [{
  llvm::Type *retType = builder.getVoidTy();
  llvm::Type *argType = builder.getInt32Ty();
  llvm::Value *arg = llvm::ConstantInt::get(argType, 3 /*memfence*/);
  createDeviceFunctionCall(builder, "_Z7barrierj", retType, {argType}, {arg});
}];

let assemblyFormat = "attr-dict";
}

grypp · February 4, 2024, 12:57pm

This new dialect fits well with ROCDL and NVVM dialects. It’s good to see MLIR expanding its capabilities to Intel GPU direction.

I’m not familiar with the compilation flow. Could you explain what this dialect lowers down to? For example, the NVVM dialect compiles to LLVM’s NVPTX backend, allowing us to test the generated LLVM, (even PTX), and run end-to-end tests. Adding this info into RFC would be helpful.

mehdi_amini · February 4, 2024, 8:26pm

Do you mind amending the proposal in the same way as the recent dialect proposals? See [RFC] Add XeGPU dialect for Intel GPUs - #18 by mehdi_amini

Thanks!

bondhugula · February 5, 2024, 11:56am

It’s great to see this proposal. It would be important to describe though what the path to execution is from this dialect.

Note that when the LLVM/NVVM dialects were added to MLIR, there was a path to lower those ops to LLVM IR (since the latter already had the NVVM intrinsics to map to) and compile to executable code.

etiotto · February 5, 2024, 5:09pm

Currently, the Intel GPU backend accepts SPIRV as input. The GEN dialect is a LLVM target dialect therefore we can use a translator (https://github.com/KhronosGroup/SPIRV-LLVM-Trans) to convert the generated LLVM IR to SPIRV. Another potential option is to use the SPIR-V backend in LLVM.

etiotto · February 5, 2024, 5:11pm

Hi, I think this reply [RFC] Add GEN dialect for Intel GPUs - #5 by etiotto addresses your question.

etiotto · February 5, 2024, 5:11pm

OK I will take a look at the new format and update the RFC.

bondhugula · February 6, 2024, 12:57am

Are both options functionally equivalent? If not, are there differences w.r.t coverage or performance? When you refer to SPIR-V above, is it the IR or the binary format? (The link is broken - missing trailing text ‘lator’.)

etiotto · February 8, 2024, 7:29pm

We have experimented with the Khronos LLVM-SPIRV translator and it works well. We haven’t experimented with SPIR-V backend, although in principle it should work. When I refer to SPIR-V in this context I meant the binary format.

Sorry about the broken link, the correct link is to the Khronos public LLVM-SPIRV translator: KhronosGroup/SPIRV-LLVM-Translator: A tool and a library for bi-directional translation between SPIR-V and LLVM IR (github.com)

etiotto · February 8, 2024, 8:05pm

Question] What is the overall goal of the dialect?

[Answer] The GEN dialect is aimed at extending the LLVM dialect with operations for exposing to the MLIR ecosystem selected Intel Xe ISA (codename GEN) assembly instructions such as the instruction used to do matrix multiplication with accumulation (i.e. DPAS).

[Question] What is the first implementation milestone?

[Answer] The first implementation milestone aims at providing operations to:

query GPU properties such as threads ID, block ID, block and grid dimensions, etc…
emit barrier and group shuffle operations
emit instructions useful to access systolic array HW for matrix operations

[Question] How does it fit into the MLIR dialect ecosystem?

[Answer] The MLIR ecosystem contains the NVVM dialect, which extends the LLVM dialect to provide operations useful for programming NVIDIA GPUs, and the ROCDL dialect which extends the LLVM dialect in a similar way and provides operations for programming AMD GPUs. The GEN dialect is positioned at the same hierarchical level as the NVVM and ROCDL dialects and provide similar functionality for programing Intel GPUs.

[Question] Connection: how does it connect to the existing dialects in a compilation pipeline(s)?

[Answer] The GEN dialect complements the LLVM dialect to provide the ability to generate instructions for accessing Intel GPUs (just like NVVM/ROCDL dialects extend the LLVM dialect to provide access to instructions for NVIDIA/AMD GPUs).

Higher level dialects, such as the GPU dialect, can lower certain operations to the GEN dialect. For example, the GPU dialect can convert gpu::ThreadIdOp to GEN::TheadIdOp.

Also, we envision that the XeGPU dialect(link) will implement a lowering path to the GEN dialect.

[Question] Consolidation: is there already a dialect with a similar goal or matching abstractions; if so, can it be improved instead of adding a new one?

[Answer] There is no other dialect in MLIR fitting this goal.

[Question] Reuse: how does it generalize to similar but slightly different use cases?

[Answer] Over time the GEN dialect can be expanded to include more operations exposing Xe ISA instructions.

[Question] What is the community of users that it is serving?

[Answer] The GEN dialect serves open source communities who aspire to build high-performance applications using the MLIR infrastructure on Intel GPUs.

[Question] Who are the future contributors/maintainers beyond those who propose the dialect?

[Answer] Intel engineers would mainly maintain it. For example whitneywhtsang (Whitney Tsang) (github.com), pengtu (Peng Tu) (github.com) and others.

bondhugula · February 9, 2024, 11:49am

Thanks. So, it looks like if these dialects are in the MLIR repo, it would be feasible to set up execution tests for the Intel GPU if the system has the SPIRV-LLVM-Translator. AFAICS, JIT execution (via mlir-cpu-runner) may not be possible (if the translator is used to go from LLVM to SPIRV), but an AOT test could be set up and execution verified. Is that correct?

rengolin · February 9, 2024, 12:32pm

Are there similar tests for NV/AMD dialects to mimic from?

bondhugula · February 9, 2024, 1:52pm

There are numerous NVGPU and AMD GPU execution tests – they are JITted and executed via the ORC JIT seamlessly. However, that approach won’t work if using the external SPIRV-LLVM translator, as I mentioned above.

rengolin · February 9, 2024, 2:11pm

I meant AOT ones.

fabianmc · February 9, 2024, 2:38pm

Quick PSA, because I think the change was introduced without much publicity:

We already have some form of GPU compilation through SPIR-V available on mlir-cpu-runner. See this test:

github.com

llvm/llvm-project/blob/main/mlir/test/Integration/GPU/SYCL/gpu-addf32-to-spirv.mlir

// RUN: mlir-opt %s -pass-pipeline='builtin.module(spirv-attach-target{ver=v1.0 caps=Addresses,Int64,Kernel},convert-gpu-to-spirv{use-64bit-index=true},gpu.module(spirv.module(spirv-lower-abi-attrs,spirv-update-vce)),func.func(llvm-request-c-wrappers),convert-scf-to-cf,convert-cf-to-llvm,convert-arith-to-llvm,convert-math-to-llvm,convert-func-to-llvm,gpu-to-llvm{use-bare-pointers-for-kernels=true},gpu-module-to-binary,expand-strided-metadata,lower-affine,finalize-memref-to-llvm,reconcile-unrealized-casts)' \
// RUN: | mlir-cpu-runner \
// RUN:   --shared-libs=%mlir_sycl_runtime \
// RUN:   --shared-libs=%mlir_runner_utils \
// RUN:   --entry-point-result=void \
// RUN: | FileCheck %s

module @add attributes {gpu.container_module} {
  memref.global "private" constant @__constant_2x2x2xf32_0 : memref<2x2x2xf32> = dense<[[[1.1, 2.2], [3.3, 4.4]], [[5.5, 6.6], [7.7, 8.8 ]]]>
  memref.global "private" constant @__constant_2x2x2xf32 : memref<2x2x2xf32> = dense<[[[1.2, 2.3], [4.5, 5.8]], [[7.2, 8.3], [10.5, 11.8]]]>
  func.func @main() {
    %0 = memref.get_global @__constant_2x2x2xf32 : memref<2x2x2xf32>
    %1 = memref.get_global @__constant_2x2x2xf32_0 : memref<2x2x2xf32>
    %2 = call @test(%0, %1) : (memref<2x2x2xf32>, memref<2x2x2xf32>) -> memref<2x2x2xf32>
    %cast = memref.cast %2 : memref<2x2x2xf32> to memref<*xf32>
    call @printMemrefF32(%cast) : (memref<*xf32>) -> ()
    return
  }
  func.func private @printMemrefF32(memref<*xf32>)
  func.func @test(%arg0: memref<2x2x2xf32>, %arg1: memref<2x2x2xf32>) -> memref<2x2x2xf32> {

This file has been truncated. show original

However, as with NVIDIA and AMD, external tooling is required to fully execute them. I can confirm they work on ALCF Aurora’s Intel GPUs.

They work by serializing the GPU SPIR-V module, and then at runtime compiling it.

CCing the author: @silee2

bondhugula · February 9, 2024, 3:09pm

Is this using the LLVM SPIRV backend? That’s not the path we were discussing - see [RFC] Add GEN dialect for Intel GPUs - #9 by etiotto If a binary is created using an LLVM backend, it would fit the mlir-cpu-runner ORC JIT approach.

fabianmc · February 9, 2024, 3:32pm

No. It’s going directly to SPIR-V.

However, from what I can tell from the RFC (maybe @etiotto can correct me), the idea is always ending up in a serialized SPIR-V binary.

Hence, in all cases we would be able to execute tests with mlir-cpu-runner. Because we already have the capacity of executing SPIR-V binaries.

In the same way ORC-JIT doesn’t generate the final CUDA binary, ORC-JIT doesn’t have to generate executable code for Intel GPUs, that’s the job of Intel’s SPIR-V backend.

jpienaar · February 10, 2024, 5:47pm

I’m in general pro this as it seems to form a useful component at the same level of backend specialization as existing ones.

Re testing: currently it’s true that folks have to download separate libraries to actually run on HW. As long as the testing is sufficient to ensure working and would not impede other upstream development (e.g., action at distance effects causing rollbacks). I’d be happy to see this and have spot for further growth. Mindful of course about duplicated functionality/aligned goal to ensure reduction of those.

etiotto · February 13, 2024, 8:36pm

That is correct.

etiotto · February 13, 2024, 8:38pm

We plan to test that the GEN dialect operations are lowered correctly by using lit tests.

Topic		Replies	Views
[RFC] Vector Dialects: Neon and SVE MLIR	15	3159	December 8, 2020
Codegen Dialect Overview MLIR	11	14615	March 24, 2024
[RFC] Add XeGPU dialect for Intel GPUs MLIR	21	11873	February 22, 2024
[RFC] Add NV-GPU dialect (HW specific extension of GPU dialect for Nvidia GPUs) MLIR	21	1539	April 15, 2022
GPU code generation status: NVidia, OpenCL MLIR	6	3084	October 23, 2020

[RFC] Add GEN dialect for Intel GPUs

Motivation

Example

Related Topics