There are a few details missing, for example, with respect to the “breakdown” of the sparse data structures, which is required to get the data properly back from device to host. So, why not let a compiler deal with the hairy details, especially for sparsity (which is why I have been a fierce proponent of sparse compilers since 1996
).
Let’s start with the dense MV and annotate the matrix as sparse, in CSR form:
#CSR = #sparse_tensor.encoding<{
lvlTypes = [ "dense", "compressed" ],
posWidth = 32,
crdWidth = 32
}>
func.func @matvecCSR(%A: tensor<?x?xf64, #CSR>,
%x: tensor<?xf64>,
%y_in: tensor<?xf64>) -> tensor<?xf64> {
%y_out = linalg.matvec
ins(%A, %x: tensor<?x?xf64, #CSR>, tensor<?xf64>)
outs(%y_in: tensor<?xf64>) -> tensor<?xf64>
return %y_out : tensor<?xf64>
}
Then we invoke the sparsifier pipeline of MLIR, with GPU acceleration enabled (by default, codegen is CPU only:
mlir-opt --sparse-compiler="enable-runtime-library=true enable-gpu-libgen gpu-triple=nvptx64-nvidia-cuda gpu-chip=sm_80 gpu-features=+ptx71" spmv.mlir
And, voilà, we get the proper sequence for SpMV (showing the IR right after the GPU part of the sparsifier, since further downstream introduces a lot of implementation details that are harder to read):
func.func @matvecCSR(%arg0: tensor<?x?xf64, #sparse_tensor.encoding<{ lvlTypes = [ "dense", "compressed" ], posWidth = 32, crdWidth = 32 }>>, %arg1: tensor<?xf64>, %arg2: tensor<?xf64>) -> tensor<?xf64> {
%c0 = arith.constant 0 : index
%c1 = arith.constant 1 : index
%0 = sparse_tensor.number_of_entries %arg0 : tensor<?x?xf64, #sparse_tensor.encoding<{ lvlTypes = [ "dense", "compressed" ], posWidth = 32, crdWidth = 32 }>>
%dim = tensor.dim %arg0, %c0 : tensor<?x?xf64, #sparse_tensor.encoding<{ lvlTypes = [ "dense", "compressed" ], posWidth = 32, crdWidth = 32 }>>
%dim_0 = tensor.dim %arg0, %c1 : tensor<?x?xf64, #sparse_tensor.encoding<{ lvlTypes = [ "dense", "compressed" ], posWidth = 32, crdWidth = 32 }>>
%1 = sparse_tensor.positions %arg0 {level = 1 : index} : tensor<?x?xf64, #sparse_tensor.encoding<{ lvlTypes = [ "dense", "compressed" ], posWidth = 32, crdWidth = 32 }>> to memref<?xi32>
%2 = sparse_tensor.coordinates %arg0 {level = 1 : index} : tensor<?x?xf64, #sparse_tensor.encoding<{ lvlTypes = [ "dense", "compressed" ], posWidth = 32, crdWidth = 32 }>> to memref<?xi32>
%3 = sparse_tensor.values %arg0 : tensor<?x?xf64, #sparse_tensor.encoding<{ lvlTypes = [ "dense", "compressed" ], posWidth = 32, crdWidth = 32 }>> to memref<?xf64>
%4 = gpu.wait async
%dim_1 = memref.dim %1, %c0 : memref<?xi32>
%memref, %asyncToken = gpu.alloc async [%4] (%dim_1) : memref<?xi32>
%5 = gpu.memcpy async [%asyncToken] %memref, %1 : memref<?xi32>, memref<?xi32>
%6 = gpu.wait async
%dim_2 = memref.dim %2, %c0 : memref<?xi32>
%memref_3, %asyncToken_4 = gpu.alloc async [%6] (%dim_2) : memref<?xi32>
%7 = gpu.memcpy async [%asyncToken_4] %memref_3, %2 : memref<?xi32>, memref<?xi32>
%8 = gpu.wait async
%dim_5 = memref.dim %3, %c0 : memref<?xf64>
%memref_6, %asyncToken_7 = gpu.alloc async [%8] (%dim_5) : memref<?xf64>
%9 = gpu.memcpy async [%asyncToken_7] %memref_6, %3 : memref<?xf64>, memref<?xf64>
%10 = bufferization.to_memref %arg1 : memref<?xf64>
%11 = gpu.wait async
%dim_8 = memref.dim %10, %c0 : memref<?xf64>
%memref_9, %asyncToken_10 = gpu.alloc async [%11] (%dim_8) : memref<?xf64>
%12 = gpu.memcpy async [%asyncToken_10] %memref_9, %10 : memref<?xf64>, memref<?xf64>
%13 = bufferization.to_memref %arg2 : memref<?xf64>
%14 = gpu.wait async
%dim_11 = memref.dim %13, %c0 : memref<?xf64>
%memref_12, %asyncToken_13 = gpu.alloc async [%14] (%dim_11) : memref<?xf64>
%15 = gpu.memcpy async [%asyncToken_13] %memref_12, %13 : memref<?xf64>, memref<?xf64>
gpu.wait [%5, %7, %9, %12, %15]
%16 = gpu.wait async
%spmat, %asyncToken_14 = gpu.create_csr async [%16] %dim, %dim_0, %0, %memref, %memref_3, %memref_6 : memref<?xi32>, memref<?xi32>, memref<?xf64>
%dnTensor, %asyncToken_15 = gpu.create_dn_tensor async [%asyncToken_14] %memref_9, %dim_0 : index into memref<?xf64>
%dnTensor_16, %asyncToken_17 = gpu.create_dn_tensor async [%asyncToken_15] %memref_12, %dim : index into memref<?xf64>
%bufferSz, %asyncToken_18 = gpu.spmv_buffer_size async [%asyncToken_17] %spmat, %dnTensor, %dnTensor_16 into f64
%memref_19, %asyncToken_20 = gpu.alloc async [%asyncToken_18] (%bufferSz) : memref<?xi8>
%17 = gpu.spmv async [%asyncToken_20] %spmat, %dnTensor, %dnTensor_16, %memref_19 : memref<?xi8> into f64
%18 = gpu.destroy_sp_mat async [%17] %spmat
%19 = gpu.destroy_dn_tensor async [%18] %dnTensor
%20 = gpu.destroy_dn_tensor async [%19] %dnTensor_16
%21 = gpu.dealloc async [%20] %memref : memref<?xi32>
%22 = gpu.dealloc async [%21] %memref_3 : memref<?xi32>
%23 = gpu.dealloc async [%22] %memref_6 : memref<?xf64>
%24 = gpu.dealloc async [%23] %memref_19 : memref<?xi8>
%25 = gpu.dealloc async [%24] %memref_9 : memref<?xf64>
%26 = gpu.memcpy async [%25] %13, %memref_12 : memref<?xf64>, memref<?xf64>
%27 = gpu.dealloc async [%26] %memref_12 : memref<?xf64>
gpu.wait [%27]
%28 = bufferization.to_tensor %13 : memref<?xf64>
return %28 : tensor<?xf64>
}
I hope this helps!