[RFC] Adding `vector.to_elements` op to the Vector dialect

dcaballe · June 9, 2025, 11:45pm

Hi all,

This is a proposal to add a new operation, vector.to_elements, to the Vector dialect.
This operation is the symmetric counterpart to the existing vector.from_elements op.

An initial prototype can be found in this PR: [mlir][Vector] Add vector.to_elements op

Proposal

The vector.to_elements operation leverages MLIR’s multi-result op feature to decompose
an input vector into all its scalar elements. The decomposed scalar elements are
returned in row-major order. For example:

// Decompose a 1-D vector of 2 elements.
%0:2 = vector.to_elements %v1 : vector<2xf32>
// %0#0 = %v1[0]
// %0#1 = v1[1]

// Decompose a 2-D vector.
%0:6 = vector.to_elements %v2 : vector<2x3xf32>
// returns 6 scalar values in flattened row-major order.
// %0#0 = %v2[0, 0]
// ...
// %0#5 = %v2[1, 2]

The operation inherently encodes the extraction order of the scalar elements without
explicit indices, providing a powerful abstraction to encode complex vector-wide
extraction and insertion transformations.

Motivation

vector.to_elements brings two major benefits when processing large sequences of
vector.extract (and vector.insert) operations:

1. Code Size Reduction

Currently, extracting all or many elements from a vector requires large sequences of
vector.extract operations. Consider a vector<1024xf32>:

Today: 1,024 separate vector.extract operations
With vector.to_elements: 1 operation

Using vector.to_elements represents a major reduction in code size for these scenarios.

2. Simplified Pattern Recognition and Optimization

vector.to_elements simplifies vector transformation analysis involving large sequences
of vector extraction and insertion operations. For instance, LLVM’s InstCombine spends
significant time analyzing chains of extractelement/insertelement instructions to
identify shuffle patterns. This requires:

Grouping large sequences of extractelement instructions by source vector
Matching large sequences of extractelement instructions to large sequences of insertelement instructions
Analyzing and sorting extraction and insertion indices

The structured and implicit order of extraction in vector.to_elements eliminates
this complexity. For example:

Folding redundant sequences of extractions and insertions can be as simple as: llvm::equals(toElements.getResults(), fromElements.getOperands())
Transforming large sequences of extractions and insertions into a shuffle (simple case) can be as simple as checking that all vector.from_elements operands originate from the same vector.to_elements.

Discussion

Relationship with `vector.extract`

vector.to_elements and vector.extract have overlapping functionality. A long-term goal could be to extend vector.to_elements to cover all use cases of vector.extract, which would allow us to propose the deprecation of vector.extract in favor of its more structured and powerful counterpart. However, we believe it is best to first introduce this operation and gain practical experience before pursuing that path.

Handling Unused Results

vector.to_elements may generate a large number of result values that have no uses. In terms of memory consumption, we do not think this is an issue compared to vector.extract operations, especially when a considerable number of elements are being extracted. Identifying and ignoring dead results during the transformation or lowering of vector.to_elements is trivial, as it only requires checking that corresponding result value has uses.

Next Steps

We seek support to add vector.to_elements to the Vector dialect. The immediate plan is to implement more advanced vector.extract/vector.insert canonicalization/transformation patterns that this new operation will enable, which are not present in MLIR today. After validating the approach and gaining more experience, we plan to continue working towards the deprecation of vector.extract in favor of vector.to_elements, if the community things this is the right way to go.

Your feedback is appreciated!
Thanks!

banach-space · June 10, 2025, 6:45pm

Thanks Diego, I am in favour of this proposal!

I guess we should really be discussing vector.from_elements + vector.to_elements together In fact, your rationale + future steps nicely apply to both.

+1 to reducing the number of Ops, but we should also refrain from creating “swiss knife” Ops. That would be my only concern.

-Andrzej

dcaballe · June 11, 2025, 5:36pm

Yes, let me elaborate on the example provided in “2. Simplified Pattern Recognition and Optimization”. Suppose we want to transform the following pattern:

%e0 = vector.extract %v0[0] : f32 from vector<512xf32> 
%e1 = vector.extract %v0[1] : f32 from vector<512xf32> 
... 
%e511 = vector.extract %v0[511] : f32 from vector<512xf32> 
 
%i0 = vector.insert %e5, %v1[0] : f32 to vector<512xf32> 
%i1 = vector.insert %e13, %i0[1] : f32 to vector<512xf32> 
... 
%res = vector.insert %e0, %i510[511] : f32 to vector<512xf32>

into:

%res: vector<512xf32> = vector.shuffle %v0, %v0 [...] : vector<512xf32>, vector<512xf32>

To achieve this, we need to:

Ensure that all the vector.insert operations contribute to the same resulting vector (%res).
Verify that all the inserted elements originate from the same source (%v0). We can also implement cases with two vector sources if we aim to generate just one shuffle. More advanced patterns generating vector.shuffle sequences are also possible but note how the complexity increases with the vector.extract/vector.insert representation.
Compute the shuffle mask by combining the extraction pattern with the insertion pattern.

Now, consider the vector.to_elements/vector.from_elements scenario to implement the same transformation:

%e:512 = vector.to_elements %v0 : vector<512xf32> 
%res = vector.from_elements %e#5, %e#13, ..., %e#0 : vector<512xf32>

With this representation:

It’s trivial to infer that all the insertions contribute to the same resulting vector (%res) by definition of vector.from_elements.
It’s simpler to infer that all the extractions originate from the same source (%v0), by checking that all the vector.from_elements operands have the same defining op and by vector.to_elements definition.
Computing the shuffle mask is also simpler since the vector.to_elements returns elements in a predefined order. The shuffle mask corresponds to the result index of each operand in vector.from_elements.

Hopefully that helps.

I think the feedback in the PR has been positive so let me know if there are any other concerns. Otherwise, let’s move forward!

Also, cc: @kuhar, @Groverkss, @newling

dcaballe · June 11, 2025, 5:41pm

I wouldn’t qualify these ops as “swiss knife” ops. They are doing only one well-defined and very specific thing, without any modifiers. They just take advantage of MLIR multi-result op features.

kuhar · June 11, 2025, 8:42pm

Thanks for writing up the RFC, Diego. I’m +1 overall. Having a counterpart to from_elements makes sense to me. In the IREE amdgpu pipeline, we end up emitting a lot of code that tries to essentially transpose or swizzle 2d vectors which ends up in long chains of inserts/shuffles/extracts. I think that lowering through to_elements + from_elements would be much more concise and readable (2 ops vs. dozens).

I don’t think we could (or want) express dynamic extracts/inserts through to_elements/from_elements. Similarly, maybe it makes sense to restrict extract to be rank-reducing and disallow extracting scalars. I don’t think we need to find an answer (or agree on one) in this RFC to make progress on to_elements though.

In terms of memory consumption, we do not think this is an issue compared to vector.extract operations, especially when a considerable number of elements are being extracted

How are results allocated? Do we need to have one vector element per result? Even if it’s fine compared to long chains of extracts, I wonder if this could become a future issue for new code that operates on large vectors and attempts to extract one element only.

I also thought about the SPIR-V lowering story – I think they key thing we need would be vector unrolling patterns for to/from_elements. Otherwise I don’t foresee issues with the lowering after we break it down into vectors of supported sizes, and the shuffle idiom pattern recognition should be similar to the one for llvm ir.

newling · June 11, 2025, 9:04pm

The from_elements + to_elements pair seems very natural to me, I’m in favour of uniting these complementary ops! ( )

I would like to dig deeper into the relationship of the extract-like ops, and lowerings to LLVM. Maybe we should discuss this in a different thread, so I can move it elsewhere it that is preferred.

We (will) have the following core ops that can do different variants of extract:

1) extract

Maps almost 1:1 with LLVM’s extractvalue. Lowering to LLVM here. The lowering fails if the extract position is dynamic. I assume that the user is expected to have performed unrolling of surrounding loops (etc) to ensure all extracts are static before lowering to LLVM.

2) extract_element (soon to be removed)

Maps 1:1 with LLVM’s extractelement. Lowering to LLVM here.

3) shuffle

Maps almost 1:1 with LLVM’s shufflevector, although the vector dialect versions supports high-rank vectors.

4) extract_strided_slice

This doesn’t map to a single LLVM operation, and has lowering patterns applied to it that convert it to vector.extract and vector.shuffle ops during LLVM conversion (see here). This op is neither a super-set of extract because it does not support dynamic position, nor is it a subset, because it slices dimensions (extract just takes a suffix shape). It is also incomplete, because striding was never implemented.

5) to_elements (proposed)

The new op. How will vector.to_elements lower to LLVM? I imagine it must ultimately lower to a sequence of llvm.extractelements? Not sure how that would fit into deprecating vector.extract.

… so lots of ways to extract vectors/scalars from vectors! It would definitely be interesting to explore reducing this set, and in my mind merging extract_strided_slice and extract into a single op isn’t unthinkable. My main concern is that these ops are used extensively downstream, and so removing/modifying them will be challenging.

dcaballe · June 12, 2025, 7:22pm

Thanks a lot for the feedback!

Good point! Replacing vector.extract was not in my original plans and I missed the dynamic case when discussing this point in the PR. You are right, supporting dynamic extraction in its most generic form is likely not feasible with vector.to_elements so vector.extract would need to stay around for these cases. I think we need to build expertise around this to better understand the role we want vector.extract to play in the end. Also agree that this is not something we have to decide right now.

Each extracted element would need an OpResult (Value) instance. All of them with the same type. It’s not too bad. This was also my concern initially but I realized that the “large vector with only a few extracts” scenario is also relatively rare. Manipulating large vectors typically entails transforming the whole vector in one way or another. Worst case scenario, we can leverage vector.extract for these cases as well.

On the MLIR side, we have support for everything that can be lowered to LLVM. I’m not sure if we can do more than that. To clarify, the lowering to LLVM of dynamic extractions is supported for 1-D vectors or n-D vectors with trailing static indices (some tests). However, n-D vectors with trailing dynamic indices are not supported because, when n-D vectors reach LLVM, they are lowered to LLVM array types, which don’t allow dynamic extraction.

Interesting enough, this proposal brough the vector.extractelement back to mind and I had sent a few PRs ([1], [2] and [3]) over the weekend to make progress on its deprecation. vector.extractelements should be gone probably before EOM.

This simplification could make sense! It ultimately depends on the use cases we end up needing vector.extract for. The vector.extract_strided_slice is a different beast but it might make sense to delegate vector.extract extractions to it if we don’t have many use cases. The fact that support for strides > 1 hasn’t been implemented over time may also indicate that we don’t need that functionality. Again, I think we need to build some experience to make a more informed decision here.

dcaballe · June 18, 2025, 4:04pm

Thank you all for the feedback. If no more questions or concerns, could I get an approval on [mlir][Vector] Add vector.to_elements op? Thanks!

Topic		Replies	Views
[RFC][PSA?] Remove `vector.extractelement` and `vector.insertelement` ops in favor of `vector.extract` and `vector.insert` ops MLIR	18	1140	July 22, 2025
[Vector] Vector distribution (large vector to small vector) MLIR	32	2063	May 10, 2024
Bug report : vector.extract_element MLIR	2	397	September 25, 2020
Improving handling of unit dimensions in the vector dialect MLIR	29	1333	December 14, 2023
[RFC][SDAG] Convert build_vector of ops on extractelts into ops on input vectors LLVM Dev List Archives	6	118	January 30, 2020