Improving handling of unit dimensions in the vector dialect

This is a continuation of the discussion in: https://github.com/llvm/llvm-project/pull/72105. One outcome of the discussion is noticing that the semantics of a number of vector ops converge when manipulating unit dimensions ( vector.shape_cast , vector.broadcast , vector.extract/insert , vector.extract_element / insert_element and vector.transpose) and cleaning them up relies on “getting lucky” with a slew of canonicalizations/patterns. This leads to difficulties in picking canonical forms and special casing in vector related patterns.

Idea #1

One approach could be to introduce vector.expand_shape/vector.collapse_shape with similar semantics to tensor.expand/collapse_shape. The observation here is that outside of vector.extract_element/insert_element, all of the above ops devolve to reshapes when the only affected dimensions are unit dimensions. This could unify a few of the representations above without having to sacrifice all “loop structure” like shape cast does.

vector.broadcast ... : vector<4xf32> to vector<1x4xf32>
// ==
vector.expand_shape [[0, 1]] ... : vector<4xf32> to vector<1x4xf32>
vector.extract %0[0, 0] : vector<4xf32> from vector<1x1x4xf32>
// ==
vector.collapse_shape [[0, 1, 2]] ... : vector<1x1x4xf32> to vector<4xf32>
vector.transpose [0, 1] ... : vector<1x4xf32> to vector<4x1xf32>
// ==
%1 = vector.collapse_shape [[0, 1]] ... : vector<1x4xf32> to vector<4xf32>
vector.expand_shape [[0, 1]] ... : vector<4xf32> to vector<4x1xf32>

Concerns with this representation could be the use of two ops to represent the single transpose; it doesn’t really look more canonical than before to me, bringing me to the second idea.

Idea #2

Add collapse and expand indices to vector.shape_cast. The observation here is that any non-scalable vector reshape can be represented with a full shape collapse + shape expand.

vector.shape_cast ... : vector<AxBxf32> to vector<BxAxf32>
// ==
%1 = vector.collapse_shape [[0, 1]] ... : vector<AxBxf32> to vector<(A*B)xf32>
vector.expand_shape [[0, 1]] ... : vector<(A*B)xf32> to vector<BxAxf32>

Default shape_cast semantics can keep the full collapse → full expand and elide the reshape indices, making this a relatively non-intrusive change. The benefit of explicitly representing the collapse/expand indices is that analysis of whether a shape cast is just a collapse or expand becomes significantly easier, and also introduces a relatively trivial pattern that rewrites any shape cast as an explicit collapse/expand. Then we can do analysis on the expand/collapse portions in isolation (or even try propagating them through the IR in different directions!).

I’m not familiar enough with scalable vector semantics to know that this is completely correct, but arbitrary shape casts don’t seem to play very nice with scalable vectors, e.g.

shape_cast ... : vector<[1]x[1]x[1]xf32> to vector<[1]x[1]xf32>

There is ambiguity to me in how the shapes are being reassociated in this case (maybe this is just illegal), but having explicit collapse/expand indices seems to clarify the semantics here (my high level understanding of scalable vectors is that they are similar to dynamic shapes in tensors but with more restrictions).

shape_cast collapse = [[0, 1], [2]] ... : vector<[1]x[1]x[1]xf32> to vector<[1]x[1]xf32>

Additionally, if we made this addition, vectorization of static tensor.collapse_shape and tensor.expand_shape could be able to preserve the reshape structure of those ops through vectorization without having to go straight to shape_cast (if such a pattern even makes sense).

Thoughts?

The above are just a couple quick ideas I’m coming up with in the wake of the discussion on the PR. There is definitely more to flesh out, and we need to be sure that this is really able to handle unit vector dims the way we want. More/other ideas are welcome.

@mehdi_amini @dcaballe @MaheshRavishankar @antiagainst @nicolasvasilache @banach-space @c-rhodes

2 Likes

This is just illegal there’s no such thing as a unit scalable dimension really, [1] is 1xruntime_constant. So vector<[1]x[1]x[1]xf32> contains runtime_constant cubed elements, and vector<[1]x[1]xf32> contains runtime_constant squared elements. These can only be the same number of elements if runtime_constant happens to be 1.

Fly on the wall: Generally, I’ve found that expand/collapse style (or unit-dim only squeeze/unsqueeze) is a very useful form for cases that it represents. A general shape cast or reshape is often reached for because it has the desirable property of being a non-data movement op, but that typically comes with information loss that makes it hard.

Someone more knowledgeable of these specific transforms than me would need to comment on which form would be better, but I instinctively shy away from making shape_cast more complicated.

Thanks for clarifying. We can ignore that part of my suggestion then. I’ll let you or others comment on whether scalable vectors could benefit from a stricter representation of reshapes.

I like the idea of squeeze/unsqueeze. It sounds like a very explicit way to address the observation that the vector ops listed above converge only for unit dims.

Thanks for pushing on the execution side of the discussion :slight_smile:. I think option #1 would be a great improvement! Aligning vector and tensor terminology on what they overlap sounds like the right thing to do and both the simplification of vector.shape_cast and the canonicalizations proposed sounds great to me. IIUC, with option #1 we would end up with vector.collapse_shape, vector.expand_shape and vector.shape_cast, where the latter would only handle cases where both the source and the destination shapes have the same rank?

Not a big deal, IMO. We should add a canonicalization pattern to make sure we preserve the collapse + expand op order (if that’s the order we want).

I would suggest that you evaluate the impact and amount of work needed for this change and perhaps share an implementation plan so that we can provide feedback. My experience is that this changes may involve far more work and pain that anticipated (see my attempt to remove vector.extract_element/vector.insert_element here, I’ve been stuck in step #2 for a few months already) as there might be quite a few patterns that need to be migrated.

Thanks for proposing this!

Note that the following would be ambiguous for scalable vectors:

vector.expand_shape [[0, 1]] ... : vector<[4]xf32> to <result>

<result> could either be (scalable “1”):

  • vector<[1]x4xf32>

or (scalable “4”):

  • vector<1x[4]xf32>.

It’s not obvious to me what should happen here and in general we might just discover that certain cases make no sense for scalable vectors. In this particular case we could do this:

  • vector.expand_shape [[[0], 1]] ... : vector<[4]xf32> to vector<[1]x4xf32>,
  • vector.expand_shape [[0, [1]]] ... : vector<[4]xf32> to vector<1x[4]xf32>.

Alternatively, one could allow “scalability” to be attached to any dimension if there’s only 1 scalable dimension. That would be semantically correct, but to generate “good” scalable code we’d still need to make sure that the “right” dim is scalable.

In practice, it shouldn’t really matter whether we have:

  • vector<[1]x4xf32> or vector<1x[4]xf32>

if the other dimension is “1”. However, your proposal applies to more general cases too (rather than just to “2d vectors with one unit dimension”).

-Andrzej

Scoping out the work sounds good. If we go with option #1, my main question would be what to do with shape_cast in the long term.

IIUC, with option #1 we would end up with vector.collapse_shape, vector.expand_shape and vector.shape_cast, where the latter would only handle cases where both the source and the destination shapes have the same rank?

I think we have a few options.

  1. Leave shape_cast as it is, similar to how tensor has a general reshape alongside the more structured collapse/expand.
  2. Restrict it in some way like you’re suggesting. I might need to see an example of what you’re trying to represent with a same-rank only shape_cast.
  3. Remove shape_cast altogether in favor of collapse + expand. This depends on the use cases of the current consumers of shape_cast.

I can try to give a more detailed analysis of the benefits of each after scoping out the work.

Thanks for proposing this!

Note that the following would be ambiguous for scalable vectors:

vector.expand_shape [[0, 1]] ... : vector<[4]xf32> to <result>

<result> could either be (scalable “1”):

  • vector<[1]x4xf32>

or (scalable “4”):

  • vector<1x[4]xf32>.

It’s not obvious to me what should happen here and in general we might just discover that certain cases make no sense for scalable vectors. In this particular case we could do this:

  • vector.expand_shape [[[0], 1]] ... : vector<[4]xf32> to vector<[1]x4xf32>,
  • vector.expand_shape [[0, [1]]] ... : vector<[4]xf32> to vector<1x[4]xf32>.

I might be misunderstanding details about scalable vectors, but would the result type alone not be enough information ? e.g.

vector.expand_shape [[0, 1]] ... : vector<[4]xf32> to vector<[1]x4xf32>

or

vector.expand_shape [[0, 1]] ... : vector<[4]xf32> to vector<[2]x2xf32>

Based on the clarification from @MacDue, it seems like there are two requirements for any vector reshape:

  1. The product of the all of the vector sizes (coefficients for scalable vectors) is the same before and after the reshape.
  2. The number of scalable vector sizes stays the same.

This way, any reshape is just a reinterpretation of the way the elements are lain out within the vector, i.e. there is no data movement. With those two conditions alone, we would still allow reshapes like this though:

vector.collapse_shape [[0, 1], [2]] ... : vector<[4]x[2]x6xf32> to vector<[8]x[6]xf32>

or

vector.collapse_shape [[0], [1], [2]] ... : vector<[4]x[2]x6xf32> to vector<[4]x2x[6]xf32>

So the op also allows reassociating scalability in the vector type, unless we require that the number of scalable dims within each collapsed/expanded group is invariant. Again, am probably missing details on whether this kind of “free-form” reassociation of scalability makes sense though. If there are any resources for reading up on the restrictions/semantics of scalable vectors I’d love to take a look.

I would think so too: that is we can’t do “type inference” on this operation, the return type must be provided by the user. I suspect @banach-space was maybe trying to provide enough information to be able to have the return type fully inferable here?

Coming late to the party here, I think some misconceptions have surfaced in the wall of text that was posted on the original PR to a level that I think we should push the reset button and restart the discussion from first principles in an ODM. I started replying to the first few comments on that thread but stopped for now and porting the discussion here until we synchronize with higher bandwidth:

Conversion to vector.shape_cast is a more efficient lowering for LLVM since LLVM supports multi-dimensional vectors. This doesnt work for SPIR-V since it does not support multi-dimensional vectors.

No, LLVM does not support multi-dimensional vectors. I am unclear what requirements SPIR-V has that differ fundamentally from LLVM where vectors are concerned.

  • I think these lowerings were introduced for practical reasons (if I hit an implementation gap I just lower to inefficient code instead of crashing). Nicolas, correct me if I’m wrong.

Spot on, the objective has always been for these shape casts to disappear by folding but in practice people were hitting NYI cases and it was better to allow to lower through an inefficient vector.shuffle via 1-D. Maybe we should make that a debug option and not activate it by default, but I suspect many things will break for flows that depend on vector.shape_cast lowering.

I don’t think vector.transpose should be a canonical representation of 1xN → Nx1. There is no data movement here so using a vector.transpose op would be misleading.

The intended semantics of vector.shape_cast is indeed that it should fold away into other operations, all the way to memory addresses and that the relative order of individual elements should be preserved. This is why we didn’t have lowerings of vector.shape_cast in the begining. However there is a different layout of information in vectors when writing

%1 = vector.shape_cast %0 vector<1x13xf32> to vector<13x1xf32>

Assuming that we are on a HW with 128b vectors, in the absence of other optimizations:

  • %0 is expected to lower to 3 instances of <4 x f32> and instance of<1 x f32>` (LLVM is very robust at performing all these split to actual HW register size)
  • %1 is expected to lower to 13 instances of <1 x f32> (LLVM is also quite good at recombining things afterwards so YMMV).

In any case, I believe it would be good to (re)read the deeper dive explanations around 'vector' Dialect - MLIR.

vector.shape_cast’s doc go stale as we iterated, I’ll send an update to describe the current semantics and we could use that as a starting point. A few things evolved from practice from the early days; originally, we even had descriptions involving tuples of vectors that are long gone. As a rule of thumb, thinking of it as linearizing to 1-D + delinearizing from 1-D is an accurate description in my view.

Regarding the current proposal, here are some meta-points that I would like to discuss in a higher-BW setting.

Re Idea #1, I appreciate the attempt to use the same semantics as tensor/memref and reuse familiar syntax that has been more battle-tested but there are fundamental differences in practice:

  • in the tensor case, we have a virtual representation that has not yet been materialized, it may or may not lower to an abstraction that moves data.
  • in the memref case, we require that there is no data movement (i.e. an explicit copy must have been inserted if data movement is required)
  • in the vector case, while the order of the data does not change by linearization + delinearization, the location of the data inside of multiple vectors is fundamentally different (see my explanation above re. vector<1x13f32>)

Additionally, I believe the reassociation information is unnecessary, in the case of tensor/memref it only serves to disambiguate in the presence of ? (i.e. does ?x?x? -> ?x? mean ?x?x? -> (?x?)x? or ?x?x? -> ?x(?x?)). I don’t think we can have such ambiguities with constants only. But I also understand some folks are working towards ? in vector types so YMMV…

Re idea #2, yes, the semantics is that the order of individual elements is preserved and once should always think of a shape cast as a linearize/delinearize step. However, data may technically “move” within the vectors, in isolation of the rest, as I explained above.

Re:

This seems incorrect to me: one should think as a scalable [1] dimension as 1*%vscale. As a consequence the expression is not homogeneous: you have %vscale^3 on the LHS and %vscale^2 on the RHS.

Last thoughts for now: a large part of the discussion also seems motivated by what we do on the SPIR-V side, it would be good to have a good description of that; in particular given the misconceptions about the availability of multi-dimensional vectors in LLVM. I suspect what is happening is that LLVM does a good job at breaking down big 1-D vectors into smaller ones with bounded shapes, as well as percolate small vectors into larger ones. As a consequence, we did not have to implement a bunch of canonicalization / foldings on 1-D vectors in MLIR. I suspect this was not the case on SPIR-V but I was not involved in those discussions.

3 Likes

Sorry, that was a poorly formulated argument on my part.

You are right, the following form contains all the required information:

vector.expand_shape [[0, 1]] ... : vector<[4]xf32> to vector<1x[4]xf32>

However, some logic will have to decide which dimensions to make scalable in the expanded shape (i.e. [1] vs [4]). Or, put differently, the notion of scalability has to be correctly propagated all the way to LLVM IR. That becomes very relevant when vector operations are unrolled - we cannot unroll scalable dimensions.

This example worries me a bit - the only target that supports scalability in 2 dimensions that I am aware of is Arm’s SME (Scalable Matrix Extension). To lower to SME we intercept all relevant operation and use custom lowerings (through the ArmSME dialect). Should this be supported in more general cases?

I’m not against this proposal, just wanted to highlight that the new ops will present some new challenges in the context of scalable vectors (nothing unsolvable IMHO). However, going back to the original intent of this RFC:

Improving handling of unit dimensions in the vector dialect

Things would become much easier if we knew that one of the dimensions is “1” :slight_smile: Perhaps that’s what’s missing here?

-Andrzej

+1 on updating the doc about what the current semantics. It would be a good starting point.

RE idea #1:

My understanding is that shape_cast has “data movement” semantics; we might want some “non-data movement” reshape ops at vector level. Introducing vector.collapse_shape and vector.expand_shape sounds good for “non-data movement” purpose and can help us towards canonicalized IR.

The question would be how we lower the new vector ops. A quick idea is folding them into memref ops, which aligns with “no data movement” semantics; lower them to vector.shape_cast with potential data movement penalties if they are not foldable.

I’ve been working on pack/unpack codegen, and think that if we can generate vector.shape_cast/vector.collapse_shape/vector.expand_shape during vectorization. Clarifying the semantics of vector reshape operations would be helpful in this case.

This is what I call “no data movement”.
At the level of the vector dialect, we shouldn’t be concerned with the fact that shuffling may be need on some HW: I see this as a “virtual ISA” and “within this virtual ISA” there is not data movement associated with %1 = vector.shape_cast %0 vector<1x13xf32> to vector<13x1xf32>.

This is pretty consistent with other aspect, even in LLVM some HW may not have support for all possible vector operations. Imagine that you don’t support “max” between two vector, the lowering would have to extract every single element into a scalar register and reassemble the vector: this is the kind of implementation detail that don’t percolate to the description of the “virtual instruction set” above.

Agreed, but I have also come to understand that others may expect “no movement” to become a noop which is a whole other discussion.

The physical medium where the data is materialized matters: in memory this can turn into a simple reindexing; in actual 1-D vectors this results in shuffles.

It depends on your HW vector size?
Actually I’d think that with 2D HW vector you are even more likely to have to shuffle. If your HW support 16x16 registers with masking, then going from 13x1 to 1x13 is actually really a transpose!
Your point (with a minor twist on 1-d → n-d) about “in memory this can turn into a simple reindexing; in actual n-D vectors this results in shuffles” may be spot on here. Even in the “virtual ISA” that is the vector dialect, shape_cast may not make sense at all!
(that is unless we decouple virtual indexing from the layout of course: RFC: Representing register data layout explicitly in the IR ).

Please, let us know if anybody plans to bring this to ODM. In the meantime, I would like to leave some comments here before I forget (we briefly discussed this topic in the Mai-Tai meeting on Tuesday):

  • Regardless of what we do with vector.shape_cast, I think we should move forward with adding vector.collapse_shape and vector.expand_shape. That would move part of the existing semantics in vector.shape_cast into two more specialized and better defined ops and should simplify the analysis and lowering of vector.shape_cast, and the scope of the problem under discussion.

  • We have been operating under the assumption that vector.shape_cast are no-op or imply “no data movement” but is that really the case? For example, a 1-D → 2D cast could imply a copy between a 1D vector register and a 2D vector register, a 2D → 2D a reconfiguration of the register file, etc., etc. There is also nothing that prevents a pass to rearrange data into multiple vector registers or moving data to scalar registers if the destination shape of a vector.shape_cast is not “friendly” to that target… Consequently, defining an op with “no data movement” or “data movement” properties (at least in the way we are using those terms here) might not be a good idea. If data needs to be moved or not seems more a property of the lowering of that op and something that is target dependent.

  • From an abstraction point of view, in addition to vector.transpose, we would need ops to expand a vector shape, collapse a vector shape and reshape a vector shape using static shapes (where the source and the destination shapes have the same rank) (i.e., new semantics for vector.shape_cast?).

  • It looks like there is a lowering of vector.shape_cast to 1-D LLVM vectors that could be reused for SPIR-V.

4 Likes

Thanks @qed for putting up the proposal to make the discussion more concrete! The nit dimension has been a consistent pain point; it would be super nice to see it under careful scruity.

A few comments from me based on my experience with vector dialect and transformations in general.

Overall, for vector dialect, I think figuring out the abstraction level and semantics of ops is important; how they compose together with each other with different transformation patterns is also key discussion points—as the way vector dialect is organized today, we have verious small atomic-like patterns to compose organically. A seemingly small change may trigger something vastly unexpected.

vector.shape_cast positioning and semantics

I personally find it hard to grasp the exact semantics of vector.shape_cast.

My understanding of vector dialect is it actually has two modes—one virtual vector mode (where we don’t have the exact mapping to hardware registers), and the other more hardware vector mode (where a vector maps to a hardware vector register level). We go from the virtual mode to the hardware mode with intra dialect lowerings.

It’s a bit fuzzy as where is the exact cut, but generally I feel n-D vectors (what we get directly after vectorization in MLIR) is the former, and 1-D small vectors is in the latter category. And we can roughly put different ops into different buckets, for example, I’d say vector.transfer_read/write are more for the former while vector.load/store for the latter. Here is my mental model of how other ops fall into. I think it sorts of agrees with what the vector dialect doc says.

Then the problem comes for vector.shape_cast is actually it crosses these two categories, with its ability to turn m-D to n-D directly. And in a on-step unstructred manner, which is sort of in conflict with how we want the progressive lowering in general with in vector dialect—

vector dialect pattern composition

To go from the high-D viritual vector to low-D hardware vector, typically we want a few steps—unrolling, dropping leading unit dims, hoisting, breaking down various extract/insert, propagating extract/insert to cancel them, fowarding vector load/store, and so on.

My experience is, it’s very tricky to organize these patterns in a coherent way. The previous link uses a convolution with some fused elementwise op as an example. In reality we can see even complicated examples—for example we can see mixed types, fused i4 dequantization + f16 matvec + f16 elementwise (for LLM decoders, for example). Various pieces of vector ops/patterns should really balance nicely to work out the final clean register level IR we want.

Looking at shape_cast and place it through the chain of conversions.

unrolling

The first big one is unrolling. I don’t quite get how we would unroll a general vector.shape_cast, and unrolling is just a big piece of transformation if vector.shape_cast is meant to exist in the virtual vector mode. If we cannot unroll it, it would become a hard blocker there—all following patterns to cancel out and clean up the vector ops to result in 1-D cases will fall apart. Do we just directly lower shape_cast there? Then are we just having prelimiary lowering / scalarizing part of the IR that blocks propagating/cancelling other transpose/insert/extract/etc. ops?

propagation and cancellation

One thing other existing vector ops is nice is that we can more easily compose them and propagate and cancel out. It’s key to generate clean IRs at the MLIR level. So for example, you see all the folding logic in VectorOps.cpp, e.g., this one for extract/insert/transpose, this one for extract/insert/broadcast, and many others in the file. These are generally simple given they are all sort of structured.

The same level pattern support for vector.shape_cast is not there I would say at least. And given the unstructredness, we’d need more analysis to see through how different dimensions relate to support it.

the proposal

Sorry for the wall of text in the above–but I hope it is some food for thoughts and provide some contexts. Coming back to the proposal, I agree of the general direction of peeling out more functionalities from vector.shape_cast and make dedicated ops for them. So collapse/expand pairs sounds good to me. I’d think they are easier to handle than the general shape_cast w.r.t. the above transformations listed given they are more structured.

Regarding shape_cast itself, I feel it might be better to restrict it to a lower level poistion in the vector dialect. But I don’t know how to best frame its semantics..

regarding LLVM vs SPIR-V

Quite some discussions were happening around LLVM vs SPIR-V, given that the change was motivited to make LLVM scalable vector cases work but breaks on SPIR-V flows.

I’d actually think that’s just smoke around. I agree with @nicolasvasilache in the above that due to LLVM doing some magic under the hood we are fine without some transformations in MLIR when targeting LLVM. But for targeting SPIR-V we need to organize various patterns nicely to generate the exact clean IR we want in the end at MLIR level. Missing pieces are still at the vector level. (Or maybe it’s just me didn’t find the proper piece to call in.)

I’d be very happy to fix whatever that is missing and use the agreed upon approach, once we have a clear understanding of how to handle vector.shape_cast in general.

2 Likes

I don’t quite understand what you mean by “crosses these two categories”, but for:

I’m wondering if this op makes sense at all in the vector dialect.

Something that was clicked for me was the connection with RFC: Representing register data layout explicitly in the IR : Triton uses tensors to model layouts, vector type does not have layout and instead assumes a notion of a 1:1 fixed layout between the vector shape and the HW (as far as I can understand).
If this mental model is correct, vector_shape is necessarily a shuffle (or a 1D linearization and nD reformation in the worst case): it’s not possible for an operation to return a vector from a different shape without data-movement (assumption: vector maps to registers with a fixed shape/layout).

I wonder if we could get away with this operation entirely? collapse/expand may cover a large portion of the use-cases, but we should review why/how the shape_cast is introduced and what the alternatives are. Maybe instead of shape_cast we could have a pair of linearize/delinearize?
(In this view: “transpose” is indeed more structured than “reshape” by the way, but “reshape” is a misleading name)

I’ve got the same intuition but have also not thoroughly evaluated. However, my experience with this kind of situation, as Nicholas pointed out upthread, makes me sympathetic to the idea that this is manifesting as a result of a shortcut to LLVM that has persisted from the early days. There have been a few such design flaws uncovered so far, primarily driven by the added diligence that representing in SPIR-V has driven. Given this is the fourth or fifth time I’ve dealt with this class of issue, I tend to listen when the SPIR-V folks say there is a representation problem. It often points to missing infra and a fair amount of getting lucky in the current lowerings.

Now that’s not enough to actually fix the situation, but we’ve done well in the past by paying credence to this perspective

Thanks for the detailed writeup Lei and Quinn. +1 to a higher bandwidth conversation.

2 Likes

I’m thinking it might be worth reconsidering squeeze/unsqueeze then. Even collapse/expand or linearize/delinearize still lowers to tricky shuffles for higher dimensional HW registers (which to me, could be another way to think about SIMD code before layout + distribution on GPU). Take a 2D HW register of 4x4 elements and try collapsing a vector<2x3> to vector<6>. This does a shuffle that separates previously “contiguous” elements.

in 1 "register":
[[0, 1, 2, _],
 [3, 4, 5, _],
 [_, _, _, _]
 [_, _, _, _]]

becomes

in 2 "registers":
[[0, 1, 2, 3],
 [...]]
+
[[4, 5, _, _],
 [...]]

Not to say collapse/expand isn’t still useful, but we have to be very careful about how the semantics are described. Conversely, squeeze/unsqueeze has much nicer properties. For example with the same physical 4x4 registers

# Transpose a column to a row of 4 elements in register
vector.squeeze ... : vector<4x1> to vector<4>
# Transpose a row to a column of 4 elements in register
vector.unsqueeze ... : vector<4> to vector<4x1>
# No-op
vector.unsqueeze ... : vector<4> to vector<1x4>
# No-op
vector.squeeze ... : vector<1x1x4x4> to vector<4x4>
# No-op
vector.unsqueeze ... : vector<4x4> to vector<1x1x4x4>
# Shuffle 4 `1x4` vectors into a single `4x4` vector
vector.squeeze ... : vector<4x1x4> to vector<4x4>
# "Unroll" a single `4x4` vector into 4 `1x4` vectors
vector.unsqueeze ... : vector<4x4> to vector<4x1x4>

Essentially, my thinking is that all squeeze/unsqueeze ops are either no-ops, transposes, or highly structured shuffles/“unrolling” that are much easier to implement/reason about than the general case of collapse/expand. Even taking the 13x1 case from earlier

# Go from 3 `4x1` + 1 `1x1` vectors to 3 `1x4` + 1 `1x1` vectors
vector.squeeze ... : vector<13x1> to vector<13>

In other words, it’s still 4 independent transposes. In general, I’d expect that the type of squeeze/unsqueeze depends only on the dimensionality of the hardware registers, not their actual sizes.

A higher bandwidth discussion sounds like a good idea.

1 Like