MLIR Sparse Compiler Progress

Now that sparse tensor types are first-class citizens and the sparse compiler is really taking shape, see for example this previous thread and the Sparse Compiler pre-print paper, it is time to make sure all other compiler optimizations compose well with sparse tensors, since proper composition is one of the major themes in the Composable and Modular Code Generation in MLIR pre-print paper.

Mostly, the composition should be completely transparent (i.e., kernels on dense or sparse tensors take exactly the same path through the compiler). Revision D119971 demonstrates such trivial compositions, since it places a set of linalg rewriting rules upstream of the sparse compiler with only minimum changes.

In some cases, however, optimizations only make sense in the context of sparse tensors, even though the rewriting would be applicable to both dense and sparse tensors alike. Revision D120429 is a first example of such an optimization that fuses two subsequent kernels by distributing a multiplication over additions (relying on arithmetic equalities that may not always hold for floating-point computations). This rewriting would be undesirable in the dense case, but may make a lot of sense for sparse tensor operands, since the resulting kernel has a potential lower asymptotic complexity.

For example, expressing SDDMM in PyTACO as follows, where S and X are sparse, and A, B, dense

X[i,j] = S[i,j] * A[i,k] * B[k,j]

yields a single Linalg op that computes the expression sum(k, S[i,j] * A[i,k] * B[k,j]). However, when expressed as follows

X[i,j] = A[i,k] * B[k,j] * S[i,j]

the semantics dictate that subexpression sum(k, A[i,k] * B[k,j]) * S[i,j] is expressed using two Linalg ops.

T[i,j] = A[i,k] * B[k,j]
X[i,j] = T[i,j] * S[i,j]

Without the fusion described above, the first kernel would compute the full dense matrix times matrix multiplication prior to doing the element-wise sampling. This would have a completely wrong asymptotic complexity. By fusing the ops, we again see that we only need to compute dot products that are not nullified.

5 Likes

Revision D124502 makes some improvements to our internal COO format, which is central to reading in sparse tensors as well as converting between sparse or dense tensors. In particular, I reduced the memory footprint of each element and reduced the (re)allocation overhead using a shared index pool that can be reserved in advance (but even when the size is not known in advance, reallocating the pool is better than allocating many small vectors). The results are very pleasant, for example, reading in sparse matrix SK-2005, with dimensions 50,636,154 x 50,636,154 and 1,949,412,601 nonzero elements improves as follows (time in ms), organized by reading, sorting and packing, or about 3.5x faster overall.

SK-2005   before         after        speedup
---------------------------------------------
read       305,086.65    180,318.12    1.69x
sort     2,836,096.23    510,492.87    5.56x
pack       364,485.67    312,009.96    1.17x
  ---------------------------------------------
TOTAL    3,505,668.56  1,002,820.95    3.50x
2 Likes

Thanks to the excellent contributions of @jim22k, the MLIR sparse compiler now provides support for arbitrary “semi-ring” operations. For example, suppose you want to implement a “sign x” operator, that has no direct equivalent in an MLIR op (yielding +1 for x>0, -1 for x<0, including for +/-Inf, but with the special cases of +/-NaN and +/-0.0 staying the same).

Something like that can be expressed as

              %z = arith.constant 0.0 : f64
              %1 = arith.cmpf one, %s, %z : f64
              %2 = arith.uitofp %1 : i1 to f64
              %3 = math.copysign %2, %s : f64
              %4 = arith.cmpf uno, %s, %s : f64
              %sign = arith.select %4, %s, %3 : f64

but the sparse compiler would originally “choke” on such a block of code (since most rules deal with the standard semi-ring operations around + and *). However, @jim22 came to the rescue and added a construct for general unary and binary operations where the programmer can define what actions to take in the absence or presence of “stored elements” (typically the nonzeros).

This allows us to express the sign operator as follows, and go through all sparse code generation as usual, eventually yielding code that only operates on the stored elements

    %xin = sparse_tensor.init [%d] : tensor<?xf64, #SparseVector>
    %0 = linalg.generic #trait_op
      ins(%arg0: tensor<?xf64, #SparseVector>)
      outs(%xin: tensor<?xf64, #SparseVector>) {
      ^bb0(%a: f64, %x: f64) :
        %result = sparse_tensor.unary %a : f64 to f64
          present={
            ^bb1(%s: f64):
              %z = arith.constant 0.0 : f64
              %1 = arith.cmpf one, %s, %z : f64
              %2 = arith.uitofp %1 : i1 to f64
              %3 = math.copysign %2, %s : f64
              %4 = arith.cmpf uno, %s, %s : f64
              %5 = arith.select %4, %s, %3 : f64
              sparse_tensor.yield %5 : f64
          }
          absent={}
        linalg.yield %result : f64
    } -> tensor<?xf64, #SparseVector>

This contribution makes a major step towards supporting general sparse array languages (also see e.g. this recent paper by Henry et al.).

You can check out the new operations in the SparseTensor dialect.

Thanks @jim22k !

1 Like

Thanks @aartbik for the kind words and for the collaboration on this. Your feedback made the syntax much more elegant than my original proposal.

There are still a few more pieces needed (in the works), but the sparse_tensor dialect is getting close to having all the functionality needed to write an implementation of GraphBLAS in MLIR, which is my ultimate goal.

2 Likes

Heads up!

Note that @matthias-springer unified the various tensor initialization operations that had proliferated over several dialects. For the sparse tensor dialect, in particular, that means that a linalg kernel that yields a sparse output, such as

%xv = sparse_tensor.init [%d] : tensor<?xf64, #SparseVector>
%0 = linalg.generic #trait_vec_op
     ins(...)
     outs(%xv: tensor<?xf64, #SparseVector>) {
      ....
} -> tensor<?xf64, #SparseVector>

should now be expressed as

%xv = bufferization.alloc_tensor(%d) : tensor<?xf64, #SparseVector>
%0 = linalg.generic #trait_vec_op
     ins(...)
     outs(%xv: tensor<?xf64, #SparseVector>) {
      ....
} -> tensor<?xf64, #SparseVector>

Hopefully the alloc_tensor operation in the bufferization dialect more clearly reflects the intention to materialize an uninitialized sparse tensor with given shape (either static or dynamic), providing an anchor to a properly typed but uninitialized sparse tensor into the output clause of the subsequent operation that yields a sparse tensor as the result. More importantly, all analysis and rewriting logic related to initializing the buffer of tensors is now shared between dense and sparse tensors, reducing code duplication and the potential of forgetting to fix bugs along one of the paths.

Thanks Matthias for making this happen. Out of tree projects depending on sparse compilation of MLIR will have to make the change illustrated above.

1 Like

I am super thrilled to announce that the paper on this topic (an earlier version was made available as pre-print on ArXiv) has been accepted by ACM Transactions on Architecture and Code Optimization as journal publication, and is available through Open Access.

  • Aart J.C. Bik, Penporn Koanantakool, Tatiana Shpeisman, Nicolas Vasilache, Bixia Zheng, and Fredrik Kjolstad. Compiler Support for Sparse Tensor Computations in MLIR. ACM Transactions on Architecture and Code Optimization, June, 2022. See DOI 10.1145/3544559.
4 Likes

The sparse compiler team at is full speed ahead on replacing calls into a runtime support library with actual code generation. Design-wise, this means that the output of “sparsification” (the process of lowering sparsity-agnostic operations on sparse tensor types into a sparsity-aware operations) is no longer fed into the “conversion” pass (which lowers the sparse tensor types to opaque pointers and sparse ops into library calls), but into a new “codegen” pass (which yields concrete buffers and code).

Clearly, reducing IR opacity will ultimately enable many more optimizations downstream.

To give a sneak peek into the upcoming differences, consider a “sparsified” output like

#CSR = #sparse_tensor.encoding<{
  dimLevelType = [ "dense", "compressed" ]
}>

func.func @sparse_values_csr(%arg0: tensor<?x?xf64, #CSR>)
       -> memref<?xf64> {
  %0 = sparse_tensor.values %arg0 : tensor<?x?xf64, #CSR>
       to memref<?xf64>
  return %0 : memref<?xf64>
}

Then “conversion”, the current reference implementation, yields the following. Although our library has high performance, this kind of IR clearly disables most downstream optimizations.

func.func @sparse_values_csr(%arg0: !llvm.ptr<i8>) -> memref<?xf64> {
  %0 = call @sparseValuesF64(%arg0) : (!llvm.ptr<i8>) -> memref<?xf64>
  return %0 : memref<?xf64>
}

In constrast, “codegen” would ultimately expose the storage as proper IR to downstream passes.

func.func @sparse_values_csr(%arg0: memref<2xindex>,
                             %arg1: memref<?xindex>, 
                             %arg2: memref<?xindex>,
                             %arg3: memref<?xf64>>) -> memref<?xf64> {
  return %arg3 : memref<?xf64>
}

Although not much is really happening in this example, hopefully it already illustrates the potential reduction in IR opacity. When more and more operations are lowered to actual code and buffers, more opportunities for downstream optimizations will arise as well!

Stay tuned for some more progress updates in this thread!

4 Likes

Just as a heads up, you may have noticed that revision D136183 completely removed vectorization support from the sparse compiler.

Wait! What?! Can it be that the author of the The Software Vectorization Handbook no longer cares about SIMD?!

Rest assured. Quite the opposite! But we want to do vectorization of sparse code the right way, through progressive lowering and clean separation of concerns. So expect to see this functionality restored soon again, but then done the right way (so in a way, this too, is a spin-off result of our [RFC] Structured Codegen Beyond Rectangular Arrays).

@dcaballe @nicolasvasilache

2 Likes

I will give a very brief progress update on the sparse compiler project here at Google (Project Updates Track) during the Nov 10 MLIR summit at the 2022 LLVM Developers’ Meeting. I hope to see you there!

3 Likes

Good news! Today I implemented a new vectorizer for the loops generated by the sparse compiler as a separate pass, which brings back all vectorization functionality previously removed (including ArmSVE support), but in a much cleaner form. Still not the final abstraction level we want, and not the general vectorizer we want either, but a much better stepping stone for incremental future improvements. And I believe this is the 8th vectorizer I wrote in my career, but who is counting :wink:

scf.for %arg3 = %c0 to %c1024 step %c32 {
  %3 = vector.load %0[%arg3] : memref<?xf32>, vector<32xf32>
  %4 = vector.broadcast %arg1 : f32 to vector<32xf32>
  %5 = arith.mulf %3, %4 : vector<32xf32>
  vector.store %5, %1[%arg3] : memref<1024xf32>, vector<32xf32>
}
2 Likes

Revision ⚙ D138236 [mlir][sparse] introduce vectorization pass for sparse loops brings back a separate pass with pretty much all prior SIMD support. There are a few last i’s to dot, such as ARMSVE scaling and dealing with indices inside SIMD code (viz. a[i]=i) but that will be taken care off by a few follow up revisions.

For our good friends at ARM, I also dotted a few i’s in the revision to make sure scalalabe vectors (ArmSVE) keep working! Here is a reduction example.

    %2 = memref.load %1[] : memref<f32>
    %3 = vector.vscale
    %4 = arith.muli %3, %c4 : index
    %5 = vector.insertelement %2, %cst[%c0 : index] : vector<[4]xf32>
    %6 = scf.for %arg3 = %c0 to %c1024 step %4 iter_args(%arg4 = %5) -> (vector<[4]xf32>) {
      %9 = affine.min #map2(%c1024, %arg3)[%4]
      %10 = vector.create_mask %9 : vector<[4]xi1>
      %11 = vector.maskedload %0[%arg3], %10, %cst : memref<?xf32>, vector<[4]xi1>, vector<[4]xf32> into vector<[4]xf32>
      %12 = arith.addf %arg4, %11 : vector<[4]xf32>
      %13 = arith.select %10, %12, %arg4 : vector<[4]xi1>, vector<[4]xf32>
      scf.yield %13 : vector<[4]xf32>
    }
    %7 = vector.reduction <add>, %6 : vector<[4]xf32> into f32
    memref.store %7, %1[] : memref<f32>
2 Likes

Great news, thank you for working on this Aart! :slight_smile: I guess that we should be able to revisit ⚙ D121304 [mlir][sparse][ArmSVE] Add sparse integration tests for ArmSVE after your patch lands?

-Andrzej

1 Like

Absolutely!

The revision D139581 provides all the required machinery to (re)introduce sparse vectorization to the ‘sparse compiler pipeline’. To enable, pass in a desired vector length and some optimization switches to the pipeline, as follows.

--sparse-compiler="vl=16 reassociate-fp-reductions enable-index-optimizations"

Enabling ArmSVE automatically enables VLA (scalable vectors), since this is currently the only target architecture this applies to.

--sparse-compiler="vl=4 enable-arm-sve reassociate-fp-reductions enable-index-optimizations"

2 Likes