Some more progress!
The sparse compiler now has two prototype strategies for generating CUDA:
- CUDA codegen: this converts sparsified code to CUDA threads
- CUDA libgen: this converts pre-sparsified code to cuSPARSE library calls
An example of the former was shown above. An example of the latter is illustrated below (note that I have extended the GPU dialect with cuSparse support, I will send that out for review shortly, since this may trigger some discussions on the proper way to represent this and whether async tokens are required; but the basic mechanism is ready to be deployed!).
func.func @matvec(%A: tensor<?x?xf64, #SortedCOO>,
%x: tensor<?xf64>,
%y_in: tensor<?xf64>) -> tensor<?xf64> {
%y_out = linalg.matvec
ins(%A, %x: tensor<?x?xf64, #SortedCOO>, tensor<?xf64>)
outs(%y_in: tensor<?xf64>) -> tensor<?xf64>
return %y_out : tensor<?xf64>
}
lowers directly into cuSPARSE:
%16 = gpu.create_sparse_env
%17 = gpu.create_coo %1, %2, %dim, %memref, %memref_2, %memref_5 : memref<?xindex>, memref<?xindex>, memref<?xf64>
%18 = gpu.create_dn_vec %memref_8, %2 : memref<?xf64>
%19 = gpu.create_dn_vec %memref_11, %1 : memref<?xf64>
%20 = gpu.spmv_buffer_size %16, %17, %18, %19
%21 = gpu.wait async
%memref_13, %asyncToken_14 = gpu.alloc async [%21] (%20) : memref<?xi8>
gpu.wait [%asyncToken_14]
gpu.spmv %16, %17, %18, %19, %memref_13 : memref<?xi8>
gpu.destroy_sp_mat %17
gpu.destroy_dn_vec %18
gpu.destroy_dn_vec %19
gpu.destroy_sparse_env %16