Sparse Representation

Greetings all. First time poster, so please - be gentle.

I want to add a sparse tensor representation into MLIR, but I am uncertain the best method to pursue. So I would like to obtain the wisdom of the hive mind. Allow me to explain what I feel are possibilities.

Modify memref with sparse markup

One of my goals is to have the sparse representation work seamlessly with Linalg, etc. just like a memref. For example, when a memref is defined, add an extra markup that would identify a specific dimension as sparse:

%0 = alloc() : memref<16x32sxf32>

Note the “s” after the 32 in the second rank. This would set a flag for the second rank to be sparse. This information would be carried all the way through to code generation where the appropriate structures would be generated and maintained.

Limitations to this idea is that the specific sparse representation cannot be specified. Is this OOC? CSR? CSL? DIA? Who knows. That could be given as a generation option (for example: mlir-opt -sparse-to-ooc). In any case, this representation seems very limiting to me and not the best option.

Add sparse as a base type

In this case, I could see adding a new base type to MLIR for sparse representation:

%0 = alloc : sparse<16x32xf32>

This representation has the potential to be customizable through parameters to the type. Perhaps an ooc or csr flag can be added to the type. Possible limitations are causing instability across MLIR for such a tightly integrated representation. This may be more work than anyone would like to encounter.

Sparse dialect with parametric type

This comes from the tutorial page Defining Attributes and Types. This method would be similar to the second option presented in this post, and would look something like:

%0 = alloc : !Sparse<16x32xf32>

Not as pretty, but more in line with how MLIR operates. As a separate dialect it shouldn’t interfere with the general operation of MLIR when Sparse is not used. I am concerned with interoperability with other dialects, but I believe any problems here would just be handled within the Sparse dialect specifically. This is the option I am leaning towards.

Call for comment

These are my thoughts. Nothing for me is set in stone, so I am excited to get comments, suggestions, opinions, etc. I want what’s best for the infrastructure, so whatever it takes me to add sparse representation in a way that the majority is comfortable with, I’ll do.

Thank you for reading, now - what do you think?


1 Like

Thanks for your posting John! Your question is very timely since we just started exploring similar ideas here at Google. I did my PhD research in automatically converting a program written in a simple dense format to a much more complex, but also more efficient sparse format [1]. This approach has many advantages, for example, users can focus on the algorithm rather than the data structures, one dense program can be mapped to several different sparse storage formats, possibly even tailored for particular nonzero structures, etc. Given that background, I thought taking the same approach in ML on sparse tensors would be worthwhile exploring!

Our initial thoughts were also around annotating dense tensors as sparse somehow, possibly by specifically indicating a particular sparse storage scheme, or leaving that open for the compiler to decide. I started prototyping some lowering ideas, but have not started anything formal yet on how annotation should look. I would love to brainstorm with you some more on this!

[1] A.J.C. Bik. Compiler Support for Sparse Matrix Computations. ISBN 90-9009442-3, NUGI 855, Leiden University, 1996.

Hi all,

Joining the conversation. I too have been working recently on compilers for sparse tensor algebra (fan of your thesis Aart!). If you’re interested the code is at and my thesis at

I think it would be good to build a sparse code generator into MLIR that could automatically generate code. That way you don’t have to implement them all by hand and that way you can port them to new architectures. My group at Stanford is working on code generation techniques for sparse computation, so would also love to brainstorm.

Thank you Aart and Fredrik. I would really appreciate any input and guidance you could provide.

I feel it’s safe to assume that the code generation will be very close to how Taco makes it happen. There’s good work there and I think the important step now is to find a representation in MLIR that will abstract enough of the details of the sparse tensor to make usage fairly seamless to the current tensor representation.

I am concerned that my pursuit of memref is not the correct direction, and that I should be focused on finding a way to implement sparse tensors using - oddly enough - the tensor type. I have been working with the memref type as this is the type that I can actually generate LLVMIR using mlir-opt -convert-std-to-llvm successfully. I am unable to do the same thing with the tensor type, so I’ve avoided doing work there.

If the tensor type is the best to work with, what mlir-opt options can I use to generate LLVMIR. Or, if that’s the wrong way to approach this problem, what suggestions do you have that would produce results? I really want to see the code generation for tensors, but as far as I can tell, this doesn’t happen.

In short, thank you for your willingness to brainstorm. Now, perhaps I can be dragged by my bootstraps to your level so we can talk like civilized engineers. :wink:


Thanks John and Fredrik, nice “emeeting” you both. Are you available for a quick virtual meeting, just so we can get to know each other better and align our objectives (others are welcome to join too)? We should of course continue to post concrete proposals in this forum, but an initial meet-and-greet may be useful. If you are willing to do so, please shoot me an email!

There’s no off-the-shelf solution for converting tensors to memrefs. There are many ways to do it that require taking one opinion or another – part of the reason that MLIR core doesn’t have this yet. Happy to elaborate if there is interest.

Tensors [quote=“jjolly, post:4, topic:1624”]
If the tensor type is the best to work with, what mlir-opt options can I use to generate LLVMIR

Tensor does not have a memory representation in MLIR: going to MLIR requires choosing some in-memory layout and mapping. This is actually why memref is there: it allows to make this choice. There is actually a pass doing buffer allocation to turn computation on tensor into computation on memrefs, I’m not sure what are the restrictions on it though.

I wondered why memref was the only code generating memory representation I could find. I see there are std ops that allow translation to/from tensor and memref, but I could not make them produce anything like the underlying structure of memref ops.

So, Under Construction eh? Seems like a place I’d like to play. :nerd_face:

In all seriousness, I’d be interested to hear the various opinions on the matter of tensors - either converting to memref or even native codegen from tensor types. Are there previous threads on the issue? Perhaps I’m overly simplistic, but wouldn’t dense tensors have a rather simple codegen? I’m not getting into my interest regarding sparse tensors, I just want to see dense be implemented properly first.

Afterward, we can start throwing around TLAs like OOC, CSR, DIA, etc. :wink:


This is similar to why scalar compilers do register allocations. Tensors are SSA values. At some point they have to be deSSA-ed to use something “physical”. That is memref. So there is quite a bit of similarity between register allocation problem and the memory allocation problem that is required to go from tensormemref representation.

That makes complete sense and is exactly how I would expect MLIR to operate. My struggle is to find a way to cause a tensor to be deSSA-ed into a memref structure - even if it’s a straight one-to-one transformation. There does not appear to be an option in mlir-opt that allows for this transformation. In fact, I have yet to find an compiler pass that seems to do any transformation of the tensor type. I’m guessing that I am missing something obvious.

I am comforted by your description with the knowledge that the tensor type is expected to be transformed into a memref through one or more compiler passes. I just need to find the pass that performs the transformation - or find a way to get it written.


As @mehdi_amini pointed out, the infrastructure to do the conversion from tensor to memref exists in MLIR proper but it’s realized out of tree as a full-fledged pass - for eg. in the mhlo → lmhlo → affine → following which MLIR proper can take over. See below if this is what you expected from codegen as one example.

This is your tensor form - it’s obvious what it’s doing. Some of these ops can just as well be defined on the std dialect.

func @func_op_long(%arg0: tensor<4xf32>, %arg1: tensor<4xf32>) -> tensor<4xf32> {
  %1 = mhlo.maximum %arg0, %arg1 : tensor<4xf32>
  %2 = mhlo.add %arg0, %1 : tensor<4xf32>
  %3 = mhlo.minimum %arg0, %arg1 : tensor<4xf32>
  %4 = mhlo.subtract %arg1, %3 : tensor<4xf32>
  %5 = mhlo.multiply %2, %4 : tensor<4xf32>
  return %5 : tensor<4xf32>
$ mlir-hlo-opt -hlo-legalize-to-lhlo -lhlo-copy-removal -lhlo-legalize-to-affine --affine-loop-fusion -memref-dataflow-opt tensor_compute.mlir
func @func_op_long(%arg0: memref<4xf32>, %arg1: memref<4xf32>, %arg2: memref<4xf32>) {
    affine.for %arg3 = 0 to 4 {
      %0 = affine.load %arg0[%arg3] : memref<4xf32>
      %1 = affine.load %arg1[%arg3] : memref<4xf32>
      %2 = cmpf "ogt", %0, %1 : f32
      %3 = select %2, %0, %1 : f32
      %4 = affine.load %arg0[%arg3] : memref<4xf32>
      %5 = addf %4, %3 : f32
      %6 = affine.load %arg0[%arg3] : memref<4xf32>
      %7 = affine.load %arg1[%arg3] : memref<4xf32>
      %8 = cmpf "olt", %6, %7 : f32
      %9 = select %8, %6, %7 : f32
      %10 = affine.load %arg1[%arg3] : memref<4xf32>
      %11 = subf %10, %9 : f32
      %12 = mulf %5, %11 : f32 %12, %arg2[%arg3] : memref<4xf32>

You may want to incrementally add each pass in that list to see the output at each stage. -hlo-legalize-to-lhlo is where the type conversion happens. Note that the final output has a single loop nest with a single store.

The issue is not just the conversion of the data type. Like you said, there is a very simple degenerate case there for dense tensors. But converting from tensors to memrefs or other “buffer-like” things also means that you give up immutability and have to start worrying about alias analysis. If you are trying to schedule work on an accelerator using a low-level API like Vulkan, you need to do certain kinds of analyses/transformations that are much easier to do while still operating on tensors. Additionally, in the presence of fusion, one needs to know which tensors in the program will even exist as memory allocations at all (e.g. when fusing multiple elementwise operations, the valuesof some “tensors” really just exist ephemerally intermediate values in registers). We’ve really struggled a lot in IREE (and continue struggling) with reconciling these issues.

We had our initial meet-and-greet meeting on sparse compilation today and, with permission, I post the participants and contact information below. Although we plan to continue our discussion mainly here on discourse, please add your own name and email if you want to be contacted for follow-up meetings in the future!

John Jolly (
Mary Hall (
Mahesh Lakshminarasimhan (
Tuowen Zhao (
Tharindu Rusira (

Michelle Strout (

Fredrik Kjolstad (
Amalee Wilson (

Gokcen Kestor (Pacific Northwest National Laboratory)

Aart Bik (
Rasmus Larsen (
Mahesh Ravishankar (
Nicolas Vasilache (
Penporn Koanantakool (

I’m interested to stay tuned in this, feel free to add me to the invite list. Thanks!

1 Like

I’m also interested in participating as I was planning to look at it from the GPU point of view.

Thomas Raoux (

1 Like

Hi all,

This looks interesting! I’d also be interested in participating. I’m currently a PhD student at MIT and have also been involved in work on the TACO tensor algebra compiler.

Stephen Chou (

1 Like

I think a reasonable design is to have a high-level dialect that describes operations on opaque tensors, independently of whether they are dense or sparse. Each tensor in this dialect then needs a type (its component type and dimension sizes) and a way to describe how its data representation. From this tensor dialect we could compile the tensor expressions directly to low-level MLIR code that operate on buffers using these memrefs.

A good way to describe the storage of a tensor, whether dense or sparse, is to separately describe the data structure of each dimension as well as the ordering of dimensions (e.g., row-major vs column-major). The per-dimension data structures match well to code generation and just a few of them are sufficient to recreate the formats that seems most relevant for sparse machine learning. These are: dense/uncompressed, compressed, single, and hash maps.

From these you could express the standard matrix formats:

  • Dense (dense,dense),
  • CSR (dense,compressed),
  • CSC (dense,compressed with reverse dimension ordering),
  • COO (compressed,single), DCSR (compressed,compressed).

Vectors and tensors fall out naturally by just adding more dimension descriptors: COO four-tensors (compressed,single,single,single), CSF three-tensors (dense,compressed,compressed). Blocked matrices can also be expressed as higher-order tensors. For example the BCSR format that seems popular for sparse ML is expressed as (dense,compressed,dense,dense).

Given tensor descriptors like these and some language of tensor operations, it is possible to build a compiler that generates code to execute the operations on the given data structures. This would work for dense tensor expressions whose tensors only have dense dimension data structure descriptors, or some mix where some tensors are dense and some sparse. The taco compiler can do it, and although other compilers of course are possible, perhaps it would be possible to add a taco backend for low-level MLIR and a front-end for high-level MLIR?

I also think it makes sense to optimize the operations before generating the code that accesses the irregular data structures using standard loop transformations applied to sparse such as split/strip-mine, collapse, reorder, parallelize (and a new called precompute). Then you don’t need to analyse sparse code. The taco system has a scheduling language to express these transformations, and with them you can get good performance on both CPUs and on GPUs. It does not have a custom vectorizer though, so that’s tbd.

1 Like

Thank you, Aart, for organizing the meeting! Please keep me on the invite list also.


1 Like

Hi all,

This seems up my street, I’d also be interested in participating. I’m currently a PhD student at University of Glasgow, and have been investigating sparsity in TVM.

Perry Gibson (

1 Like

Also posting an update I received from Gokcen:

Gokcen Kestor, PNNL,
Ruiqin Tian, PNNL,
Andrew Lumsdaine, PNNL, University of Washington,