It is time to continue the older sparse compiler thread into a new topical thread.
Even though the MLIR sparse compiler is obviously meant as a re-targetable tool to exploit sparsity, most development so far has been focused on generating sparse code that runs on a CPU. Even though CPUs are least used for ML problems, having the ability to easily exploit unstructured sparsity still makes this a viable approach for accelerating sparse problems with very high sparsities.
Recently, however, we started to look into accelerating the generated sparse code for GPUs as well, with a new focus on exploiting structured sparsity (for example block-sparsity and 2:4 sparsity). To this end, we started to develop a prototype GPU code-generator. It is extremely basic, uses a very simple memory passing between host end device, and does not yield much performance gains, yet. But we hope this is the first step towards a much better GPU code-generator, allowing for hybrid execution, with unstructured sparse running on the CPU and structured sparse accelerated on the GPU.
You can find the very first step (with some follow up revisions that define a working compiler pipeline, and an end-to-end example). This very primitive GPU code generator basically converts the outermost loop of the generated sparse kernel into threaded code. For example, something like this:
func.func @matvec(%A: tensor<?x?xf64, #CSR>,
%x: tensor<?xf64>,
%y_in: tensor<?xf64>) -> tensor<?xf64> {
%y_out = linalg.matvec
ins(%A, %x: tensor<?x?xf64, #CSR>, tensor<?xf64>)
outs(%y_in: tensor<?xf64>) -> tensor<?xf64>
return %y_out : tensor<?xf64>
}
is sparsified and then made parallel as follows, where the parameters are host registered buffers that contain the sparse matrix in CSR format.
gpu.module @sparsekernels {
gpu.func @kernel(%arg0: index,
%arg1: memref<?xf64>,
%arg2: memref<?xindex>,
%arg3: memref<?xindex>,
%arg4: memref<?xf64>,
%arg5: memref<?xf64>) kernel {
%c1 = arith.constant 1 : index
%0 = gpu.block_id x
%1 = gpu.block_dim x
%2 = gpu.thread_id x
%3 = gpu.grid_dim x
%4 = arith.muli %0, %1 : index
%5 = arith.addi %4, %2 : index
%6 = arith.muli %1, %3 : index
scf.for %arg6 = %5 to %arg0 step %6 {
%7 = memref.load %arg1[%arg6] : memref<?xf64>
%8 = memref.load %arg2[%arg6] : memref<?xindex>
%9 = arith.addi %arg6, %c1 : index
%10 = memref.load %arg2[%9] : memref<?xindex>
%11 = scf.for %arg7 = %8 to %10 step %c1 iter_args(%arg8 = %7) -> (f64) {
%12 = memref.load %arg3[%arg7] : memref<?xindex>
%13 = memref.load %arg4[%arg7] : memref<?xf64>
%14 = memref.load %arg5[%12] : memref<?xf64>
%15 = arith.mulf %13, %14 : f64
%16 = arith.addf %arg8, %15 : f64
scf.yield %16 : f64
} {"Emitted from" = "linalg.generic"}
memref.store %11, %arg1[%arg6] : memref<?xf64>
}
gpu.return
}
}
And, yes, before the experts chime in, this is typically not the best way to make SpMV parallel
But, all basic blocks are now in place to further develop GPU code generation into something with higher performance, especially when focused on structured sparsity.
Your insights and ideas are welcomed here!
Stay tuned for updates!