Sparse Compiler and GPU Code-Generation

It is time to continue the older sparse compiler thread into a new topical thread.

Even though the MLIR sparse compiler is obviously meant as a re-targetable tool to exploit sparsity, most development so far has been focused on generating sparse code that runs on a CPU. Even though CPUs are least used for ML problems, having the ability to easily exploit unstructured sparsity still makes this a viable approach for accelerating sparse problems with very high sparsities.

Recently, however, we started to look into accelerating the generated sparse code for GPUs as well, with a new focus on exploiting structured sparsity (for example block-sparsity and 2:4 sparsity). To this end, we started to develop a prototype GPU code-generator. It is extremely basic, uses a very simple memory passing between host end device, and does not yield much performance gains, yet. But we hope this is the first step towards a much better GPU code-generator, allowing for hybrid execution, with unstructured sparse running on the CPU and structured sparse accelerated on the GPU.

You can find the very first step (with some follow up revisions that define a working compiler pipeline, and an end-to-end example). This very primitive GPU code generator basically converts the outermost loop of the generated sparse kernel into threaded code. For example, something like this:

func.func @matvec(%A: tensor<?x?xf64, #CSR>, 
                  %x: tensor<?xf64>,
                  %y_in: tensor<?xf64>) -> tensor<?xf64> {
  %y_out = linalg.matvec
      ins(%A, %x: tensor<?x?xf64, #CSR>, tensor<?xf64>)
      outs(%y_in: tensor<?xf64>) -> tensor<?xf64>
  return %y_out : tensor<?xf64>

is sparsified and then made parallel as follows, where the parameters are host registered buffers that contain the sparse matrix in CSR format.

gpu.module @sparsekernels {
    gpu.func @kernel(%arg0: index,
                     %arg1: memref<?xf64>,
                     %arg2: memref<?xindex>,
                     %arg3: memref<?xindex>,
                     %arg4: memref<?xf64>,
                     %arg5: memref<?xf64>) kernel {
      %c1 = arith.constant 1 : index
      %0 = gpu.block_id  x
      %1 = gpu.block_dim  x 
      %2 = gpu.thread_id  x 
      %3 = gpu.grid_dim  x
      %4 = arith.muli %0, %1 : index
      %5 = arith.addi %4, %2 : index
      %6 = arith.muli %1, %3 : index
      scf.for %arg6 = %5 to %arg0 step %6 {
        %7 = memref.load %arg1[%arg6] : memref<?xf64>
        %8 = memref.load %arg2[%arg6] : memref<?xindex>
        %9 = arith.addi %arg6, %c1 : index
        %10 = memref.load %arg2[%9] : memref<?xindex>
        %11 = scf.for %arg7 = %8 to %10 step %c1 iter_args(%arg8 = %7) -> (f64) {
          %12 = memref.load %arg3[%arg7] : memref<?xindex>
          %13 = memref.load %arg4[%arg7] : memref<?xf64>
          %14 = memref.load %arg5[%12] : memref<?xf64>
          %15 = arith.mulf %13, %14 : f64
          %16 = arith.addf %arg8, %15 : f64
          scf.yield %16 : f64
        } {"Emitted from" = "linalg.generic"} %11, %arg1[%arg6] : memref<?xf64>

And, yes, before the experts chime in, this is typically not the best way to make SpMV parallel :slight_smile:
But, all basic blocks are now in place to further develop GPU code generation into something with higher performance, especially when focused on structured sparsity.

Your insights and ideas are welcomed here!
Stay tuned for updates!


Some more progress!

The sparse compiler now has two prototype strategies for generating CUDA:

  1. CUDA codegen: this converts sparsified code to CUDA threads
  2. CUDA libgen: this converts pre-sparsified code to cuSPARSE library calls

An example of the former was shown above. An example of the latter is illustrated below (note that I have extended the GPU dialect with cuSparse support, I will send that out for review shortly, since this may trigger some discussions on the proper way to represent this and whether async tokens are required; but the basic mechanism is ready to be deployed!).

 func.func @matvec(%A: tensor<?x?xf64, #SortedCOO>,
                    %x: tensor<?xf64>,
                    %y_in: tensor<?xf64>) -> tensor<?xf64> {
    %y_out = linalg.matvec
      ins(%A, %x: tensor<?x?xf64, #SortedCOO>, tensor<?xf64>)
      outs(%y_in: tensor<?xf64>) -> tensor<?xf64>
    return %y_out : tensor<?xf64>

lowers directly into cuSPARSE:

    %16 = gpu.create_sparse_env
    %17 = gpu.create_coo %1, %2, %dim, %memref, %memref_2, %memref_5 : memref<?xindex>, memref<?xindex>, memref<?xf64>
    %18 = gpu.create_dn_vec %memref_8, %2 : memref<?xf64>
    %19 = gpu.create_dn_vec %memref_11, %1 : memref<?xf64>
    %20 = gpu.spmv_buffer_size %16, %17, %18, %19
    %21 = gpu.wait async
    %memref_13, %asyncToken_14 = gpu.alloc async [%21] (%20) : memref<?xi8>
    gpu.wait [%asyncToken_14]
    gpu.spmv %16, %17, %18, %19, %memref_13 : memref<?xi8>
    gpu.destroy_sp_mat %17
    gpu.destroy_dn_vec %18
    gpu.destroy_dn_vec %19
    gpu.destroy_sparse_env %16

The first revision with the gpu dialect changes and mlir_cuda_runtime wrappers to cuSPARSE is out for review in ⚙ D150152 [gpu][sparse] add gpu ops for sparse matrix computations (I could use some help figuring out the CMake syntax for including the cuSparse headers and library).

I will send out a follow-up CL that uses this from the sparse compiler pipeline shortly so that the changes fall into the right context.

This revision ⚙ D150170 [mlir][sparse][gpu] first implementation of the GPU libgen approach shows how the actual libgen path works in the sparse compiler pipeline. Currently only for matvec on COO with F64 data, but generalizing this to other kernels and data types is straightforward once the initial larger revisions have been accepted.

Lastly, I plan to add an end-to-end test to the stack of changes. After that, the code will be generalized and refined, followed by actual performance analysis of course.

And finally for today, one end-to-end example in ⚙ D150172 [mlir][sparse][gpu] end-to-end integration test of GPU libgen approach. As stated before, lots of refinements will follow, but I would like to get consensus on the basic approach first.