[RFC] tensor.pack and tensor.unpack

Hi All,

@MaheshRavishankar @hanchung @rengolin @nicolasvasilache and I would like to propose the introduction of two new operations in the tensor dialect: tensor.pack and tensor.unpack.

pack and unpack return new tensors within which the individual elements are reshuffled according to the packing specification. This has the consequence of modifying the canonical order in which a given operator (i.e., Matmul) accesses the individual elements. After bufferization, this typically translates to increased access locality and cache behavior improvement, e.g., eliminating cache line splitting.

The rationale: Optimized memory formats are needed to obtain high-performance code [1] for different primitives such as convolutions (see [2] section 2) and matmul (see [3] section 4.2.3).
As the input tensor may not be in the optimal memory layout, we provide pack and unpack.

Before moving forward with a PR, we would like to hear your thoughts. Please review our document regarding the operations’ design, lowering, and syntax.

doc: RFC pack/unpack - Google Docs

[1] Reorder — oneDNN v3.0.0 documentation
[2] [1808.05567] Anatomy Of High-Performance Deep Learning Convolutions On SIMD Architectures
[3] https://www.cs.utexas.edu/~flame/pubs/GotoTOMS_revision.pdf

CC @sanjoyd and @raghavanr we discussed this a bit in the context of TCP with @rengolin and may be interesting for you.

Can you unpack (ha!) what you mean by “mathematical perspective” a bit?

One of your example changes a tensor<128x256xf32> to a tensor<4x16x32x16xf32>; naively I’d think this does change the tensor from a mathematical perspective. For instances, valid indices into the former are not valid indices into the latter.

Sorry, with that I mean: we are not changing the original content of the tensor but just relayout its elements. It is essentially a copy from one tensor to another.

Is “pack” the equivalent of reshape+transpose? (maybe with some expansion for padding also).

The RFC describes the proposed operations, but does not get into the motivation and in particular comparison to alternatives: like is “pack” the right level of abstraction for a single op compared to having multiple ops to reason about this?

Correct.

However, the result of the pack operation is a packed tensor, which has a different layout, but is still the same tensor as before because the iteration space is also warped. So, if you iterate the tensor with the warped index, you get the exact same tensor as the original. Here, by “warped” I mean with an affine map.

In that sense, pack is the function that takes the tensor from the domain D to the image I (via some affine map and padding), so iteration index f(x) needs to be g(x) where g(x) = map(f(x)). And unpack does the reverse, bringing the tensor back into the domain D. These ops need to be bijective for that to work without loss.

As you describe, it should be possible to use existing ops to achieve the same goal, as a combination of pad, reshape and transpose. But getting all of the cases that we need (a generic block&tile) can get quite convoluted or impractical to do with a series of ops.

There are also other issues to consider:

  1. Pad is on tensor dialect while transpose is on Linalg. We originally had the op on linalg, but this changed after talking to IREE folks. I don’t mind either way.
  2. Tensor has extend_shape and reshape, where the former is closer to what we want, but also contract. I’m not sure how to control which dimensions to block with reshape.
  3. In due time, we want to be able to prove that an unpack is the reverse of a pack, and doing so with spread out ops that could be rearranged or simplified can become impossible

Proving bijection is far down in our TODO list, definitely not for the first iteration, but the idea is that it should be possible to do with these two ops.

I wouldn’t want to start moving the chairs around ops and dialects just because of this, at least not before we all agree on the general usage. Both our group and IREE have been using these ops for a while and it’s quite useful for matmul, convolution, etc, so I’d recommend we get them in some initial shape upstream, so that other people can use them too, and see what we really want.

Once it’s clear, then we can change/move around/whatever people feel is best.

2 Likes

One shouldn’t be using “memory formats” and memory-related concepts to describe operations on tensors! As you know, an MLIR tensor is a value and internally a collection of values (with the ability to think of it with a layout using an attribute). What you are doing is creating a new tensor that has the old tensor values (or a subset of it) now laid out differently and with padding optionally.

You are applying these abstractions to eventually achieve packing into and unpacking from buffers, which is a standard HPC transformation to eliminate cache conflict misses, reduce TLB misses, improve prefetching, and enable vectorization in some cases.

Have you instead considered achieving this in the memref world instead of creating parallel abstractions for everything in the tensor space? The lowering for such memref abstractions is natural without the need to invent even more new things. If you already have a memref that you want to re-layout, there isn’t an operation but memref.copy can just be supported with different layout maps on the LHS and RHS and memref normalization will already give you the desired code. (So, memref.pack is nothing but a memref.subview + copy with pad.)

I am a -1 on slowly adding memory re-layout/repacking abstractions on tensors through new ops and names, and creating a parallel universe. It just leads to new named ops names with ugly-looking code around (like in this issue) for what is naturally expressible with memref + affine/scf. It also unfortunately leads to new solutions to problems that didn’t exist in the first place – with new canonicalizations, new analyses, and new transformations to tackle.

In summary, if you are thinking about the layout of data in memory, cache memory locality, and performance related to memory access patterns (which is what per your rationale and the references linked), working on the memref abstractions in MLIR will allow you to model things staying close to their eventual effects. Unfortunately, a whole bunch of stuff has already been incrementally dumped into MLIR upstream (all the tensor slicing and manipulation-related stuff with scf loops around), which IMO is in the wrong direction. I hope this can all be taken out of the MLIR repo at some point.

Correct.

Our first prototype was on linalg (on Tensors), which is destination-passing on Tensors. The main difference here is that all Tensor dialect ops currently have value semantics, not that tensors can’t (or shouldn’t) use destination passing semantics.

Should we also change linalg to not use DPS on tensors?

For the record, I’m not terribly happy to introduce those ops to the tensor dialect either, I’d prefer them to be in linalg, but this did not seem the consensus of our initial collaboration with IREE. However, moving this to memref will cause some pain, because we’ll be moving down a semantics that we need to be up (in linalg, before tiling).

You are right; I should not have mentioned “memory” as we are still in tensor, but I don’t entirely get what is wrong with "creating a new tensor that has the old values … now laid out differently. The documentation on the tensor dialects states: “The tensor dialect is intended to hold core tensor creation and manipulation ops”. So why tensor.pack does not belong to this category?

Yes, I tried, but it is not working for me, as it does not compose well with the rest of the infrastructure. For example, after packing a set of operations, you may want to tile and fuse them, and these optimizations work on value-based IR. This duality (or parallel abstraction) exists because we want value-based IR, as it simplifies optimizations. I think tensor.pack is in line with what we have today in the tensor dialect.

It doesn’t follow that because memref+affine exist, we should not have a complete universe of ops for doing value domain manipulation, which is what this is doing. Doing these kinds of manipulations in the value domain is incredibly useful when doing high level optimizations on whole programs, and it is not mutually exclusive with lower level affine representations.

There are a lot of opinions about what should come out of (or be disaggregated from) MLIR core. I think we should discuss this separately and proactively. I know, for example, that there are strong opinions about moving affine out, but I wouldn’t use this as a justification to block any of your considered enhancements to it. I think there is a safe and productive way to have such discussions, but piecemeal on each isolated change isn’t it, imo.

6 Likes

Having a single op at tensor level has other benefits. It enables tile+fuse for the pack op w/ producers and unpack op w/ consumers, while we can implement tiling interface for pack/unpack ops. It will be tricky if generalizing them into a sequence of operations. E.g., dealing fusion tiling with tensor.expand/collapse_reshape would be hard. We explicitly know that we want to fuse this type of reshape op with other operations while sometimes we just want reshape ops being metadata ops. IMO, having a single pack/unpack op at tensor’s world can enable better analysis and optimization over whole program scope.

Just as a data point, in IREE we do use sequence of other operations to mimic packing behavior… My current understanding is that that is too much of a generalization. For example if you want to fuse the packing code into a single kernel, the fusion to handle the pad + reshape + transpose (which is what pack is) puts you in a more general solution space than pack. I dont think this logic applies in every case, but in this case seems like using a specific operation for pack allows you to handle it better, than generalizing and using a more general fusion approach to fuse them back. So we were independently coming up with similar abstractions to model this behavior…

Sorry @rengolin :slight_smile: . But it has long been a complaint that Linalg has “too much in it” and concepts of Linalg need to be abstracted out. Already a lot of ops (like tensor.insert/extract, tensor.expand_shape/collapse_shape and their memref counterparts) have been moved out of Linalg. Adding pack operation in Linalg seems like going in the opposite direction. tensor operations do seem to be the more logical place for these, especially since the destination style interface also has moved out of Linalg dialect.

1 Like

Heh, it wasn’t a complaint, just a fact. :slight_smile:

The important thing is that we want a pack operation on tensors, to do tiling and fusing afterwards, and in which dialect it lives is less important to me. We can bike-shed that later if people feel inclined. Now, we really need to get those ops upstream, so we can start working together, not in separate.

+1, we really need good abstractions for packing.

We critically rely on this type of transformation in our matrix multiplication library Ruy to run at extremely high performance. My writeup about packing in Ruy is here if anybody wants to dig in.

2 Likes

Great, I strongly support that perspective.

Given the overwhelming support from everyone interested in program transformations on tensor SSA values, can we start sending PRs ?

3 Likes

Just to clarify: whether an abstraction for packing is needed isn’t the issue under discussion (per the thread above). Packing is useful and has been demonstrated to be essential for decades to achieve high performance. Abstractions for packing are also useful. Where you add these abstractions (tensors or memrefs or both) is where the debate is. All arguments I see above are along the lines: because we are already doing tiling and fusion on tensors, we need packing here. These are all memory layout and reordering abstractions, and you’d be naturally digging yourself deeper if you went down adding more memory related abstractions on tensors. For eg., in the future, if someone wanted to pack and place the packed buffer in a specific memory space, wouldn’t you want to encode that memory space right away instead of doing that further down in the pipeline? You’d need a “memory space” for a tensor: memrefs already have it.

You are placing the cart before the horses here. Having DPS on tensors shouldn’t determine where you add packing abstractions.

The arguments you and others are making in favour of adding it on tensor types are all along the lines: we already ended up adding and using fusion/tiling on tensors, and so we want even more of these memory-hierarchy related abstractions/transformations on tensor types. I still see no technical merit in adding these abstractions on tensor types.

This is the part I don’t agree with, and you are making a pretty general argument/statement here as opposed to on the issue in discussion. The value domain is incredibly useful for certain transformations because you are using SSA in its first class form (CSE, folding constants, dead code elimination, etc.) as opposed to “memory via SSA”. The point I’m making is that the pack and unpack abstractions on tensors instead of memrefs places actions at a distance from its effects: pack/unpack are performed to improve memory locality, exploit the memory hierarchy, improve memory access patterns, and use/exploit different memory spaces. You’d be modelling these too early if you did these on tensor types and one would find oneself wanting even more related abstractions in the future.

I dont see why the tensor.pack operation is “a memory related abstraction”. It is literally taking a tensor that is nD, and converting it to a high dimensional tensor by re-organizing the elements of that tensor. You now have a new tensor. In tensor world this has copy semantics. So I dont see any “memory related abstraction” here. The fact that you would also like to do this on memrefs is a separate issue.

That seems like it is mixing concerns. I dont see a harm in having a tensor that you annotate through some encoding that it needs to go into shared memory. The only thing you need to account for is during bufferization reduce the amount of memory used in practice. memref and tensor have different function. Eventually everything has to be mapped to memref cause you need explicit loads/stores, but staying in tensor land avoids having to do complex alias analysis to ensure that your transformations are correct. (I know cause I implemented tile and fuse using Linalg ops on memrefs and it was a nightmare. I am happy I removed it link).

To echo Stella’s point from earlier: The tensor based code-generation approach is one way of generating code (one that has been used effectively by downstream users like IREE), but is not the only way to generate code. There is no harm in having multiple tools in the toolbox.

I dont understand the actions at a distance part. There is no action at a distance with pack and unpack operations. In any case, this is being built up and used in IREE and will be tested on real models. I’d be happy to share our findings at that point.

3 Likes

Destination-passing style + tensors + predictable in-place row-major bufferization is kind of hacky but basically gives you ability to reason about buffers while still retaining use-def chains and value-semantics. It is startlingly effective. The action at a distance is intentional I think – basically we want to be dealing with a “shadow buffer” under the hood of the tensor, but do so through the value-semantic shackles that retain the analyzability.

We don’t have to avoid admitting it – we are using tensors as a “buffer abstraction” or “memory abstraction” – when we call tensor.pack/unpack we do so with an understanding that the final bufferized code will have memory laid out in a certain way. That is fine and beautiful, since we still obey the semantics of the tensor type. We are not requiring a particular lowering, but we have built a bunch of infra that gives the intended performance when the tensor type gets lowered (bufferized) in a certain way.

This approach has distinct advantages and tradeoffs compared to memref. And considering that we actually started from memrefs in linalg and moved most of our infra to tensors, the evidence is that it works better for this part of the project than memrefs.

2 Likes

If you believe the merit in these choices for code generation, you are free to explore it while keeping these ops in a downstream repo. If there is a dependence issue, it’s a good thing to pull out the remaining stuff instead of dumping them upstream first and then figuring out where to move them or move them out.

The argument that: “it’s used by IREE and that IREE has users, and so this infrastructure has to be part of MLIR core and live upstream” isn’t a great one. I see this argument being made multiple times. There have been several situations where users submitted evolved stuff from their downstream repos upstream or kept things downstream tentatively: I’m not sure why an inverse approach is being chosen here – that of dumping it upstream first and then later thinking of whether it has to come out and iterating in-tree. The bar has to be consistent and uniform for everyone.

On a more general note (and this has been echoed multiple times), I feel specific approaches and strategies to do code generation should be outside the core infrastructure. The need for tensor.pack/unpack ensues from your choice to perform or realize locality transformations on the tensor abstraction: it looks like you can accomplish this by having a downstream component (dialect/repo) for IREE. I see MLIR increasingly as just the core infrastructure + dialects + utilities providing various abstractions: an approach to do memory-style transformations on tensors would lead to not just ops, but a large number of transformation utilities added to the core repo just for that purpose. I strongly feel it has to be kept outside. And the same argument applies not just to the pack/unpack tensor ops but all the tiling/fusion stuff on tensors that should come out I feel.

There are multiple approaches to compilation and code generation, and I acknowledge there are different viewpoints as to the winning approach if there is one. But personally, I feel such abstractions on tensors lead to an ugly mess and are only inventing problems that didn’t exist in the first place with more suitable abstractions. I also understand that a viewpoint like this shouldn’t impact the decision to include these. Instead, the key question is: is it really core/central to MLIR?

Could you consider iterating downstream on these abstractions?

Sure. But could you wait until then and compare it with a competing approach? That would allow you to make a stronger case.

1 Like