[RFC] Clang frontend changes for OpenCL C Cooperative Matrix Extension

  1. Overview

This is a first step towards adding support for a new type known as “cooperative matrix” along with new built-in operations to the OpenCL C language as proposed by the Cooperative Matrix Extension by Khronos OpenCL work group. This new type and the subsequent built-in operations will allow for representing matrices and optimized matrix multiply operations in OpenCL.

  1. Background

A “cooperative matrix” type was first introduced in the VK_KHR_cooperative_matrix extension for Vulkan (VK_KHR_cooperative_matrix(3)). The extension added support for “cooperative matrix” types in SPIR-V which are primarily supported in compute shaders. Such types are used to represent matrices, the storage for which is spread across all invocations in some scope (usually a subgroup) and those invocations cooperate to efficiently perform matrix multiplies.

For the sake of brevity, the presentation from the Khronos group here Cooperative Matrix Multiply gives further explanation of the extension in detail.

An initial version of the OpenCL extension for cooperative matrices can be found here. Changes proposed in this RFC would add to the initial extension by enabling OpenCL C language support for cooperative matrices.

Also, we would like to acknowledge Imagination Technologies® for their contributions to this document.

  1. History of cooperative matrix in LLVM project

The SPV_KHR_cooperative_matrix is a SPIR-V extension introduced by Khronos. This extension is supported in the official LLVM SPIR-V backend. Our proposal aims to generate LLVM IR in such a way that it can be consumed by the SPIR-V backend without any further modifications.

  1. Proposed Approach

We propose changes to the clang OpenCL front-end to represent (and lower to LLVM IR) a new “cooperative matrix” type that can be defined in a kernel source along with built-in operations proposed by our extension. The following sections describe our proposed changes in detail.

4.1. Cooperative Matrix Type

A “cooperative matrix” type can be defined in a kernel source as:

<component_type> <var_name> __attribute__((coop_mat(_scope_, _row_, _column_, _use_)));

 Example 1: float vmat1 __attribute__((coop_mat(CLK_COOPERATIVE_MATRIX_SCOPE_SUBGROUP, 8, 8, CLK_COOPERATIVE_MATRIX_A)));
 Example 2: typedef half chmat44 __attribute__((coop_mat(CLK_COOPERATIVE_MATRIX_SCOPE_SUBGROUP, 4, 4, CLK_COOPERATIVE_MATRIX_B)));
chmat44 vmat2;

Where _attribute_ ((coop_mat(_scope_, _row_, _column_, _use_))) is our proposed custom type-attribute to declare matrix types with any trivial datatype already available to the OpenCL-C language as the base/component type. In the above example, base/component type is a scalar numerical type (float/half). The scope parameter is one of the supported memory scopes. The matrix type will be spread across all the invocations in this memory scope. Currently the only supported _scope_ value is:

CLK_COOPERATIVE_MATRIX_SCOPE_SUBGROUP

Our implementation can be extended to add more memory scopes in future as needed.
_use_ parameter defines the use for the matrix variable and is one of following predefined Enums (introduced by the extension):

 CLK_COOPERATIVE_MATRIX_A
 CLK_COOPERATIVE_MATRIX_B
 CLK_COOPERATIVE_MATRIX_ACCUMULATOR

4.2. Lowering to Clang AST

This custom attribute is then parsed and lowered to the Clang AST by deriving a class called clang::CooperativeMatrixType from the upstream clang::MatrixType. We have used clang::ConstantMatrixType class as reference and extended it to store _scope_ and _use_ information.

The extension introduces the __opencl_c_ext_cooperative_matrix feature name to guard the cooperative matrix type generation.

4.3. Lowering to LLVM IR

This cooperative matrix type is then lowered to the LLVM-IR utilizing the Target Extension Type as target(“spirv.CooperativeMatrixKHR”, , , , , ). We largely chose this representation to allow a cooperative matrix type in the IR to be a target agnostic type which can be lowered further according to a particular target architecture. Also, the name of this target extension type has been selected to be in sync with the existing implementation for cooperative matrix in the SPIR-V backend. For example, following is a definition of a cooperative matrix called m. LLVM IR representation of the data type for m is shown here.

 typedef half matA __attribute__((coop_mat(CLK_COOPERATIVE_MATRIX_SCOPE_SUBGROUP, 2, 4, CLK_COOPERATIVE_MATRIX_A)));
 matA m;
 target("spirv.CooperativeMatrixKHR", half, 2, 4, 0, 0) // LLVM IR representation of data type for m

4.4. Built-in Functions

The following Built-in Functions are introduced by the extension:

 <coop_mat_type> coop_mat_load(const gentype *p, const coop_matrix_layout_t layout, const size_t stride) // Load a cooperative matrix from p.

 void coop_mat_store(const gentype *p, <coop_mat_type> m, const coop_matrix_layout_t layout, const size_t stride) // Store a cooperative matrix to p.

 <coop_mat_type> coop_matmul_add(<coop_mat_type> A, <coop_mat_type> B, <coop_mat_type> C, <coop_mat_operands> coop_mat_operand) // Matrix multiply of A by B and then component-wise add C.

As described in section 3.1 all the <coop_mat_type> are represented using the CooperativeMatrixType and then lowered to the Target Extension Type in LLVM-IR.

The layout argument indicates the layout of the matrix values in memory. It can accept one of the following predefined Enum values as introduced by the extension:

 CLK_COOPERATIVE_MATRIX_LAYOUT_ROW_MAJOR
 CLK_COOPERATIVE_MATRIX_LAYOUT_COLUMN_MAJOR

The coop_mat_operand argument is used to represent additional information about input and output matrices, and also about the operation being performed. Following predefined values for this operand are introduced in the extension:

 CLK_COOPERATIVE_MATRIX_OPERAND_NONE
 CLK_COOPERATIVE_MATRIX_OPERAND_MATRIX_A_SIGNED
 CLK_COOPERATIVE_MATRIX_OPERAND_MATRIX_B_SIGNED
 CLK_COOPERATIVE_MATRIX_OPERAND_MATRIX_C_SIGNED
 CLK_COOPERATIVE_MATRIX_OPERAND_MATRIX_RESULT_SIGNED
 CLK_COOPERATIVE_MATRIX_OPERAND_SATURATING_ACCUMULATION

4.5. Lowering to LLVM IR:

All operations are lowered to SPIR-V friendly LLVM IR builtin function calls. Again, the names of these functions have been selected to be in sync with the names found in the existing implementation for cooperative matrix in the SPIR-V backend.

coop_mat_load

 typedef float matA __attribute__((coop_mat(CLK_COOPERATIVE_MATRIX_SCOPE_SUBGROUP, 8, 4, CLK_COOPERATIVE_MATRIX_A)));
 matA m;
 m = coop_mat_load(A, CLK_COOPERATIVE_MATRIX_LAYOUT_ROW_MAJOR, 3);

 // results in
 %2 = call target("spirv.CooperativeMatrixKHR", float, 8, 4, 0, 0) @_Z32__spirv_CooperativeMatrixLoadKHR(ptr %1, i32 0, i32 3)
 store target("spirv.CooperativeMatrixKHR", float, 8, 4, 0, 0) %2, ptr %m

coop_mat_store

 typedef half matA __attribute__((coop_mat(CLK_COOPERATIVE_MATRIX_SCOPE_SUBGROUP, 2, 4, CLK_COOPERATIVE_MATRIX_A)));
 matA m;
 coop_mat_store(A, m, CLK_COOPERATIVE_MATRIX_LAYOUT_ROW_MAJOR, 3);

 // results in

 %1 = load target("spirv.CooperativeMatrixKHR", half, 2, 4, 0, 0), ptr %m
 call void @_Z33__spirv_CooperativeMatrixStoreKHR(ptr %2, target("spirv.CooperativeMatrixKHR", half, 2, 4, 0, 0) %1, i32 0, i32 3)

coop_matmul_add

 typedef float matA __attribute__((coop_mat(CLK_COOPERATIVE_MATRIX_SCOPE_SUBGROUP, 4, 8, CLK_COOPERATIVE_MATRIX_A)));
 typedef float matB __attribute__((coop_mat(CLK_COOPERATIVE_MATRIX_SCOPE_SUBGROUP, 8, 2, CLK_COOPERATIVE_MATRIX_B)));
 typedef float matC __attribute__((coop_mat(CLK_COOPERATIVE_MATRIX_SCOPE_SUBGROUP, 4, 2, CLK_COOPERATIVE_MATRIX_ACCUMULATOR)));
 matA a;
 matB b;
 matC c, d;
 d = coop_matmul_add(a, b, c, CLK_COOPERATIVE_MATRIX_OPERAND_MATRIX_RESULT_SIGNED);

 // results in
 %3 = call target("spirv.CooperativeMatrixKHR", float, 4, 2, 0, 2) @_Z34__spirv_CooperativeMatrixMulAddKHR(target("spirv.CooperativeMatrixKHR", float, 4, 8, 0, 0) %0,  target("spirv.CooperativeMatrixKHR", float, 8, 2, 0, 1) %1, target("spirv.CooperativeMatrixKHR", float, 4, 2, 0, 2) %2, i8 8)
 store target("spirv.CooperativeMatrixKHR", float, 4, 2, 0, 2) %3, ptr %d

4.6. Operators

The supported list of operators on a cooperative matrix type include arithmetic binary operators i.e. add (+), subtract (-), multiply (*), and divide (/) along with the arithmetic unary operator negate (-). Each operation is performed component-wise with the operands having identical types. The arithmetic binary operator multiply (*) however can also be used on a cooperative matrix type and a scalar with the scalar type matching the component type of the matrix. Each operation is then parsed and lowered to LLVM-IR as built-in function calls.

Example:

The example code snippet below shows how supported operators on a “cooperative matrix” type are lowered into LLVM-IR. We again use the same naming strategy as earlier.

 typedef float matA __attribute__((coop_mat("CLK_COOPERATIVE_MATRIX_SCOPE_SUBGROUP", 4, 8, "CLK_COOPERATIVE_MATRIX_A")));
 matA c = a * 4;
 c = -c;
 c = c + c;

 // Results in 
 %2 = call target("spirv.CooperativeMatrixKHR", float, 4, 8, 0, 0) @_Z34__spirv_CooperativeMatrixScalarMul(target("spirv.CooperativeMatrixKHR", float, 4, 8, 0, 0) %  1, i32 4)
 store target("spirv.CooperativeMatrixKHR", float, 4, 8, 0, 0) %2, ptr %c
 %4 = call target("spirv.CooperativeMatrixKHR", float, 4, 8, 0, 0) @_Z34__spirv_CooperativeMatrixScalarNeg(target("spirv.CooperativeMatrixKHR", float, 4, 8, 0, 0) %3)
 store target("spirv.CooperativeMatrixKHR", float, 4, 8, 0, 0) %4, ptr %c
 %6 = load target("spirv.CooperativeMatrixKHR", float, 4, 8, 0, 0), ptr %c
 %7 = call target("spirv.CooperativeMatrixKHR", float, 4, 8, 0, 0) @_Z34__spirv_CooperativeMatrixBinaryAdd(target("spirv.CooperativeMatrixKHR", float, 4, 8, 0, 0) %5, target("spirv.CooperativeMatrixKHR", float, 4, 8, 0, 0) %6)
 store target("spirv.CooperativeMatrixKHR", float, 4, 8, 0, 0) %7, ptr %c
  1. Next steps

Submission of a patch that covers the following:

  1. Extension of existing clang::MatrixType class to represent co-operative matrix. It will be great to get feedback from the community if there are any alternate approaches, or if a new class will be preferred.

  2. Lowering to the LLVM-IR utilizing the Target Extension Type.

  3. Introduction of OpenCL CTS tests and LIT tests to support this implementation.

The implementation will be introduced and managed by multiple members of the Khronos OpenCL work group, including, but not limited to, Imagination Technology and Qualcomm Inc.

Thanks

Thanks for the RFC! Some initial questions:

The SPV_KHR_cooperative_matrix extension provides some functionality I don’t see exposed or discussed in this proposal, in particular:

  • access to the components of an individual work-item,
  • the number of components accessible by an individual work-item (i.e., OpCooperativeMatrixLengthKHR), and
  • conversions between cooperative matrix types with different element types.

Are these outside of this extension proposal? If so, perhaps it’s still worth briefly discussing them to ensure current design decisions don’t block future efforts of bringing all of the SPV_KHR_cooperative_matrix functionality to OpenCL C.

I suppose you have some sort of prototype already. Did you see if the attribute-driven approach works well enough for diagnosing ill-formed programs? For example, I’d expect clang to be able to diagnose incompatible matrix argument types in a coop_matmul_add call.

Hi @svenvh

Thanks so much for your feedback, You are correct. There are a few missing items in this extension, including the number of components accessible by an individual work items, etc. This proposal is aimed towards adding initial support for OpenCL C cooperative matrix extension, most importantly load, store, and matrix muladd. We do expect more functionalities to be added in subsequent iterations. All the three items listed by you, and vector-matrix conversions are in the plans.

We do currently have a working prototype, We have several checks in clang frontend that will generate diagnostic messages. Some examples are (not the final version, just to give a flavor)

def err_coop_matrix_useA: Error<
  "argument of cooperative matrix must be CLK_COOPERATIVE_MATRIX_A">;

def err_coop_matrix_useB: Error<
  "argument of cooperative matrix must be CLK_COOPERATIVE_MATRIX_B">;

def err_coop_matrix_useACC: Error<
  "argument of cooperative matrix must be CLK_COOPERATIVE_MATRIX_ACCUMULATOR">;

def err_coop_matrix_element_type : Error<
  "inconsistent cooperative matrix element type">;

def err_coop_matrix_row_or_col_mismatch : Error<
  "mismatch of cooperative matrix row or column">;

def err_coop_matrix_use_type : Error<
  "inconsistent cooperative matrix use type">;

def err_unsupported_coopmat_binary_operator : Error<
  "unsupported cooperative matrix binary operator">;


Thanks again.

Sincerely

That you for the proposal! In general LGTM, because this path already works and supported by SPIR-V backend and SPIR-V to LLVM IR translator. Yet let me ask some questions and do recommendations/proposals:

===

// Results in
%2 = call target(“spirv.CooperativeMatrixKHR”, float, 4, 8, 0, 0) @_Z34__spirv_CooperativeMatrixScalarMul_Z34__spirv_CooperativeMatrixScalarMul(target(“spirv.CooperativeMatrixKHR”, float, 4, 8, 0, 0) % 1, i32 4)
store target(“spirv.CooperativeMatrixKHR”, float, 4, 8, 0, 0) %2, ptr %c

Slight correction, scope must be first among literal operands.

===

CLK_COOPERATIVE_MATRIX_SCOPE_SUBGROUP

At the moment this value is not defined in the spec (unless I’m missing something). Can we re-use OpenCL’s subgroup scope?

===

@_Z34__spirv_CooperativeMatrixMulAddKHR_Z34__spirv_CooperativeMatrixMulAddKHR(

If we go with SPIR-V friendly-like approach I suggest to spell-out mangling rules explicitly (or give a reference to appropriate documentation). As of right now it’s not obvious, how cooperative matrix type should be included in overloads.

===

coop_mat_load
coop_mat_store

coop_matmul_add

Question about these examples, per my understanding having things like “matA a;” should implicitly call matrix initialization aka OpCompositeConstract in SPIR-V, while I don’t see it in LLVM IR. Is OpenCL spec cover implicit initialization? And is there explicit initialization of a matrix? (both are better to be adds for accumulator matrix).

===
From what I see, element-wise operations are not supported in OpenCL and there are no such plans, right? (by element-wise I mean: get owned by invocation slice of a matrix, get an element from 0 to SLICE_SIZE index and load/store/change it, in SPIR-V it is AccessChain + load/store + modification).

===

__spirv_CooperativeMatrixScalarMul

I like this more, then the translator’s workaround, thanks! But can we rename it to something like

__spirv_CooperativeMatrixKHRElementWiseMulOp or __spirv_CooperativeMatrixKHRMulOp

?

Also applicable to other element-wise ops.

===
Slightly out-of-scope of this RFC, again, SPIR-V friendly LLVM IR just works. But. How do you feel about adding appropriate matrix intrinsics to LLVM? It will help to utilize SPV_KHR_cooperative_matrix extension also for other compilation paths other than OpenCL compilation (read: MLIR). Also, it’s much easier to register such intrinsics to various optimization passes. For example, if I remember correctly, depending on how accumulator is produced and where it’s placed in CFG, SROA + mem2reg might have hard times putting matrix initializer into a register, leaving alloca of a matrix, which without special handling will tank performance. Handling it by checking builtin names seem to be workaround-like, but taking a note about intrinsic feels more legit :slight_smile:

Another good thing about intrinsics is that we can more easy to lower them to proper hardware instructions after LLVM => SPIR-V => LLVM roundtrip.

It’s also easy to extend them to, for example, introduce type interpretation metadata (for OCL floats for example).

UPD: though, the problem with intrinsics would probably be a necessity to define the term ‘invocation’ in lang ref

@YuriPlyakhin @dkhaldi FYI

Hi @MrSidims

Thanks so much for your comments. You have correctly identified several gaps in the design. I seem to have pruned the document a bit more than necessary. I will add the necessary changes in a few days.

We have also been working on an implementation so that’s the cause for delay.

Thanks for your patience.

Sincerely