Sparse Representation

Thank you Aart, that means a lot coming from you. I read your thesis carefully as well when we were forming the ideas, and learned a lot.

I look forward to working with you too. In particular, it would be fantastic if we could use the current taco code base a starting point for building an MLIR sparse lowering machinery. There are some issues with the codebase that we are trying to work out (happy to describe them), but I think it would be a good starting point. We are happy to gradually yield our control of the source code to a broader community (that also include us) if that helps. What do you think of this path?

I think if we put our resources together we could get a good jump on sparse MLIR compilation. In particular, we’re very interested in using requirement from the MLIR workloads as a jumping off point for student research projects. These projects could at first be done on forks and be integrated only when the broader community is comfortable with them.

4 Likes

Hi,

I’d like to be added to the invite list too, please! I’m a CS grad student at Stanford.

Thanks,
Matthew (mattlkf@cs.stanford.edu)

1 Like

Let’s try to go back to the original question on how to annotate tensors as sparse in MLIR. I would like to propose a few extra requirements.

(1) If possible, the meaning of the sparsity annotations should be fully transparent to the existing framework. So, although the extra information may have to be propagated during rewriting and lowering, passes are free to ignore the information. If the future “sparse compiler” component in MLIR is not activated at all, tensors with sparse annotations simply lower to dense tensor code, using the paths that exist today.

(2) The actual “sparse compiler” that interprets the annotations and generates code that operates on sparse storage schemes should probably run rather late in the progressive lowering, just so that most optimizations on the tensor index kernels are shared between the dense and sparse paths.

(3) To facilitate progressively lowering, it probably makes sense to support the annotation on the types tensor, memref, and perhaps even vector.

(4) The per-dimension annotation proposed by Fredrik makes a lot of sense, since it encapsulates all common sparse storage schemes for matrices and generalizes nicely to tensors. It does not directly support very specific schemes tailored to particular nonzero structures (e.g. band, diagonals, jagged diagonals), but perhaps these are less important in the ML context; also the compiler can probably still select these for particular per-dimension annotation situations.

Given these requirements, something like an extra attribute

tensor<100x100xf32> { sparsity = ["dense", "compressed"] }

would make sense, where the sparsity list has as many components as there are dimensions in the tensor, although this construct obviously does not work out of the box in MLIR. There will be several details to work out adding such support. For example, requirement (1) requires that a value of the type above can be assigned to something of an unannotated type tensor<100x100xf32>, i.e. the annotation does not change the uniqueness of the type. Only the “sparse compiler” should test for type consistencies with respect to the annotations.

Please let me know if you think these ideas make sense.

Hi Aart,

Previously, we discussed having a layout string attached to tensor type, something along the lines of tensor<?x?x?x?xf32, "NCHW">. Wouldn’t this work just as well for sparse formats?

My concern here is that there are a lot of different sparse formats and the details matter. Modeling that with an open representation like a string seems like a good way to go.

-Chris

1 Like

Absolutely! (provided we can propagate such a layout string to memrefs and provided that this does not mess up type uniqueness along the non-sparse path)

Note that the per-dimension annotation of sparse/compressed/… already is able to define quite a rich set of sparse formats (see previous posting of Fredrik) and we could encode this as easily in a string (e.g. “S=[DC]”) as in the attribute I suggested above. And having a general string will also allow us to hint at more esoteric sparse storage schemes in the future somehow (like jagged diagonals).

Thanks for the pointer. Is there a previous proposal I should look at?

A proposal to add layout to Tensor ; slides - recording

from an ODM meeting on it [it is old by now]

1 Like

This is tricky though, the sentence already looks almost self-contradictory: you can’t really ignore an information you have to propagate.
We can add such attribute on all the type you mention, but I’m not sure we can do that without making it more of a first-class concept: i.e. it can’t just be an opaque metadata that we just ignore. To be safe for example, it isn’t clear to me that canonicalization rewrite could touch operations when they involve such annotated type.

You are right, but I wanted to make a more holistic point of contrasting an annotation with a more explicit approach such as an explicit new sparse type. In the explicit case, all passes that encounter this new sparse type would immediately break, since they don’t know what to do with this new type. An annotation on a tensor, on the other hand, would allow these passes to simply continue with dense tensor code generation by ignoring or even dropping the annotation.

But ultimately even an annotation carries semantics, and we indeed need to teach all paths along MLIR that can lead to the new “sparse compiler” how to properly propagate the annotation, or else the mechanism will break.

Does this sound more reasonable to you?

Yes.
As we discussed this morning, I’m also interested in hearing some of the alternatives: in particular other graph optimizer (TensorFlow and XLA for example) are modeling this by annotating the layout with an attribute on the operation producing the tensor instead of on the tensor type itself.

Sure, this is also reasonable. The challenges with this that I see are:

  1. you have to standardize the attribute name (bad if you want multiple dialects, e.g. for different ML frameworks, to line up), or define an operation interface that retrieves this.
  2. Attaching it as an attribute means you have to special case bb/function arguments and results.

Encoding this information in the type system avoids these problems, and provides a consistent way to do this. It also allows things like the XLA layout optimizer to have a more consistent representation to analyze, set and update.

When lowering in (e.g.) a code generation, I would recommend that the lowering code reject unknown layouts with an error instead of simply swallowing them, but other things are possible of course. The reason for this is that you want to detect bugs early etc.

I think I found a compromise that allows me to start working on a simple prototype without committing to wide-ranging changes in the type system yet. Since the Linalg dialect will act more or less as tensor index notation, it is straightforward to just add an extra per-dimension map to the trait for lingal.generic.

For example, a kernel C_ij = A_ij + B_ij with A sparse (in the second dimension, to get CSR storage) would be expressed as follows (here sparse_map is new, D = dense, S = sparse). This approach also has the advantage that it is currently ignored by all other passes, only the “sparse compiler” will verify the annotations for consistency, and lower the kernel to sparse code when requested.

#trait_matadd = {
  indexing_maps = [
    affine_map<(i,j) -> (i,j)>,  // A
    affine_map<(i,j) -> (i,j)>,  // B
    affine_map<(i,j) -> (i,j)>   // C
  ],
  sparse_map = [
    "DS",  // A
    "DD",  // B
    "DD"   // C
  ],
  iterator_types = ["parallel", "parallel"],
  doc = "C(i,j) = A(i,j) + B(i,j)"
}

func @generic_op_matadd(%arga: tensor<32x32xf32>, %argb: tensor<32x32xf32>) -> tensor<32x32xf32> {
  %0 = linalg.generic #trait_matadd
      ins(%arga, %argb : tensor<32x32xf32>, tensor<32x32xf32>) {
    ^bb(%a: f32, %b: f32):
      %0 = addf %a, %b : f32
      linalg.yield %0 : f32
  } -> tensor<32x32xf32>
  return %0 : tensor<32x32xf32>
}

Thanks Aart.

This does match the fact that Linalg is easily extensible to more advanced use-cases with extra levels of indirection and is a key direction to investigate.

I like the fact that we can turn the lights on with incremental steps, without immediately having to go deep into implementing the LLVM data structure abstraction (in particular, interactions with the vector dialect will be key to longer-term success). Instead we can take advantage of the interplay with the external library you started developing for reading from the matrix market.

That being said, even for the purpose of just turning the lights on, a few things need to happen. Here is a potential list that gets us to an end-to-end executable state:

  1. extend LinalgStructuredOpInterface to add support for this attribute and we still need a bunch of patterns to fail to apply if this attribute is present: I don’t think we can just ignore it, ever.
  2. if we start in the tensor world, we have the opportunity to control the whole end-to-end story + allocation ourselves. This is interesting because we can fake it for a while with a special linalg.sparse_alloc/sparse_dealloc op for which we can use a new pattern to convert to std.call and that just offloads to external C/C++ to do something useful.
  3. these ops should declare the proper interfaces and will play nicely with Linalg buffer allocation, that we will have to extend a tiny bit.
  4. we can connect the dots using rank-erased tensors and buffers for now. This will likely require a few minor extensions to some Linalg pieces to allow us to say: “special semantics, keep out” to the rest of the infra.
  5. underlying the rank-erased abstraction we can implement whatever format we want to start from directly in C++, or just link some existing sparse library.
  6. in a first prototype, we’ll likely need special linalg.sparse_load/linalg.sparse_store ops that take a rank-erased buffer + some extra information and call into external C/C++ to get the data and e.g. return some zero padding value if the sparse structure has no entry.
  7. -convert-linalg-to-loops should be extended to generate special control-flow and these new ops. This should be relatively straightforward for generic. The special control-flow I am thinking about is for the purpose of not needing newer scf.for/scf.while to turn on the lights; it pseudo-resembles:
for %i, %j ... 
  %val = linalg.sparse_load(%A, %i, %j, %pad_val)
  %cond = cmp "==", %val, %pad_val
  if (%cond) {
    ...
    store
  }
  1. having the loop bounds properly derived in point 7. may require a bit of genuflexions and a special linalg.sparse_dim that again goes to library and gets the proper value. Pretty quickly, this may need to take outer loop induction variables.

Now the list above is but one possibility, other paths may also make sense so please feel free to adapt and farm out work. Still, I believe that being able to interop with external libraries + hide things from MLIR and then gradually expose more things to MLIR in a bottom-up fashion has many advantages. It certainly has proven beneficial in the past.

Once the lights are on we can start pulling in the sparse + TACO tricks, proper MLIR sparse type representation, add new scf.for/scf.while as needed to factor out control-flow overhead and connect to vectors. We can incrementally retire the things that are subsumed without breaking library interop.

I won’t start making a laundry list of things that will be involved there and that you are likely to post soon anyway :slight_smile: .

Re.

I’ll just pretend you wrote:

sparse_map = [
    ["D", "S"],  // A
    ["D", "D"],  // B
    ["D", "D"]   // C
  ],

:slight_smile:.
It’s just a matter of using ArrayAttr::getAsValueRange to write our interfaces rather than iterating over StringRef and parsing it manually. Then the day we want something else, we can just change the type easily.

I’m really excited about these developments and looking forward to the transformation steps.

Thanks for pushing this angle!

2 Likes

But of course! :wink:

Its seems reasonable to me that the annotation be added to the ops first. I guess for now it is “expected” that the operands are sparse the “right” way, i.e. there is no check to ensure the tensors match up?

My main question is would this eventually make it to the type itself, i.e. the tensor type has extra information that would effectively represent the same information as the sparse_map and this attribute would become unnecessary?

The “sparse compiler” part will verify the consistency of the annotation. If a tensor appears twice in a kernel, for example, the annotation should be consistent. Otherwise it is rejected. We probably also want to add an affine map to specify index order on tensors (viz. TACO’s way of defining requires (1) per-dimension annotation, (2) dimension order for each tensor).

Yes, eventually we need to design a proper type system which would obsolete (or at least imply) the annotation I propose above. The current solution is pragmatic: it allows me to start prototyping this idea starting from a Linalg expression. Eventually, all ends need to be connected to get an end-to-end solution working.

Dr. Gokcen Kestor will give a presentation titled “Code generation for sparse computational kernels in MLIR” on Dec 2, 2020, at 10am. Everyone who signed up by email will receive a Google Meet invite with more information. I personally look forward to this talk!

I have been following this discussion through my PhD student John Jolly, but I wanted to take a moment to personally express interest in participating in this design and development effort on behalf of my research group and collaborators Michelle Strout (Univ. of Arizona) and Cathie Olschanowsky (Boise State). We have been working on sparse polyhedral optimization and code generation for some time (see, e.g., papers in CGO’14, PLDI’15, SC’16, PLDI’19), and we view MLIR as an opportunity to deploy support for sparse representations in widely-used framework and leverage other parts of the system.

The idea proposed by @aartbik of delaying binding of the sparse format mirrors what we are thinking. We are also interested in supporting the following:

  • Composition of the data representation with the schedule
  • Parallelizing computations with data dependences (beyond reductions), such that runtime information is needed to finalize the schedule (e.g., sparse triangular solve)
  • How architecture might impact data representation and schedule, and abstractions that defer lowering to architecture-specific code
  • Data layout and value optimizations (e.g., selectively zeroing out values)
    A recent talk at the IA^3 workshop has more details: https://hpc.pnl.gov/IA3/

As a PI on the DOE Exascale Computing Project, I am also interested in the role MLIR might have in building a software ecosystem for the HPC community. This is already happening to some extent with LLVM, but certain kinds of optimizations are difficult to do at lower levels of abstraction.

For now, we are monitoring what is happening on this thread and the design thread, and three of my students, including John Jolly, are testing out approaches for some of the objectives described above.

2 Likes

Thank you very much, Prof. Hall, for your reply and the background information. I should indeed have made it more clear in this thread that I also spun off another thread specifically for the MLIR work.

I look forward to the results of your students, and perhaps even some upstream contributions. Please let us know when you are ready to present something, either specifically to this sparse audience, or more general in our Open Design Meetings.

Welcome here @mhall :slight_smile:

It’s great to read this! This is really in line with the idea I was pushing in my keynote at the LLVM-HPC workshop last week (temporary links for slides and recording FYI, in particular the last 15min starting slides 41 or 50min in the recording). I’m really looking forward to this!

2 Likes

Many thanks to Dr. Kestor for giving a very interesting talk on “Code generation for sparse computational kernels in MLIR” (and thanks to our PM Pankaj for quickly making the recording publicly available).

2 Likes