[RFC] Interfacing between fixed-length and scalable vectors for VLS vector code on scalable vector architectures

javiersetoain · June 9, 2022, 11:11am

I’m trying to work out how to interface between fixed-length and scalable contexts within MLIR. I believe there are a few of us working on this, and now is probably the best time to tackle it. First, let me start with a description of the problem and its motivation.

Quick refresh on scalable vectors

(Skip if you’re already familiar with scalable vectors)

A scalable vector holds a number of elements that’s a multiple of a base size, and that multiple is a runtime constant: the vector scale (a VPU design parameter).

For instance, if a vector<4xf32> can hold 4 floating point elements, a scalable vector<4xf32> can hold 4, 8, 12, … up to a limit defined by the ISA. In MLIR, we represent such a vector as vector<[4]xf32>, meaning that those 4 dimensions could have a multiplicity, or vector scale, greater than 1.

Scalable vectors create a new software-side concept: vector-length agnosticism. An operation is vector-length agnostic when it works for any possible vector scale. Likewise, a code is vector-length agnostic when it works for any possible vector scale.

Conceptually, if a fixed-length vector addition loop is something like:

for (unsigned i = 0; i < num_data_elements; i += 4) {
    v4f32 a = load_vector(data, i); // Load <4 x f32>
    vector b = a + a;               // Element-wise add of <4 x f32> to <4 x f32>
    store_vector(result, i, b);     // Store <4 x f32>
}

A vector-length agnostic equivalent is something like:

for (unsigned i = 0; i < num_data_elements; i+= 4 * vector_scale) {
    sv4f32 a = load_scalable_vector(data, i); // Load <vscale x 4 x f32>
    sv4f32 b = a + a;                         // Element-wise add of <vscale x 4 x f32> vectors
    store_scalable_vector(result, i, b);      // Store <vscale x 4 x f32>
}

Where vector_scale is the runtime constant defining the scale of our vectors. When the operations are more complex than a simple element-wise vector addition (think of horizontal reductions), a useful conceptual model is to understand the vector-length agnostic operation as an implicit loop over contiguous vectors of the base size.

For instance, a common dot product VLA operation:

for (unsigned i = 0; i < num_data_elements; i += 4 * vector_scale) {
    sv4f32 a = load_scalable_vector(a_data, i); // Load <vscale x 4 x f32>
    sv4f32 b = load_scalable_vector(b_data, i); // Load <vscale x 4 x f32>
    sv4f32 c = vla_scalable_dot_product(a, b);  // Perform vscale x <4 x f32> by <4 x f32> dot products
    store_scalable_vector(c_data, i, cv);       // Store <vscale x 4 x f32>
}

Is a common vector-length agnostic way to perform a 4-wise dot product. The reduction doesn’t happen across the whole length of the physical vector, but across segments of the base vector length (4 x f32). As if the code were:

for (unsigned i = 0; i < num_data_elements; i += 4 * vector_scale) {
    for (unsigned j = i; j < i + 4*vector_scale; j += 4) {
        v4f32 a = load_vector(a_data, j); // Load <4 x f32>
        v4f32 b = load_vector(b_data, j); // Load <4 x f32>
        v4f32 c;
        c[0] = dot_product(a, b);         // Perform one <4 x f32> by <4 x f32> dot product
        store_vector(c_data, j, c);       // Store <4 x f32>
    }
}

The opposite of a vector-length agnostic (VLA) operation or code, is a vector-length specific (VLS) operation or code. Notice that not all operations on scalable vectors are vector-length agnostic. For instance, shuffle ops or extract ops that operate over the total length of the vector (not a base segment), would not be VLA, even if they’re being performed on scalable vectors. Likewise, a non-segmented horizontal reduction on a scalable vector (e.g.: an operation that computes de addition of all the values in a scalable vector and returns a single scalar) is not VLA either.

Vector-length specific code on scalable architectures

Why?

For performance reasons, even if our vector architecture is scalable, we may want to generate VLS code. We can assume a specific vector scale, and generate code with a known target vector length. In principle, as long as we make sure that the host architecture that will run the code supports the target vector length, we can generate correct VLS code that runs on a scalable architecture.

There are a couple of different ways to go about this:

Generate fixed-length vectors of the appropriate size (the assumed vscale times the base length), for instance: vector<16xf32> for a scalable architecture of 512 bits and a base length of 128 bits (a vscale of 4), and use function attributes that force the instruction selector to pick scalable instructions (if there is a fixed-length alternative).
Generate scalable vectors (e.g.: vector<[4]xf32> in the example above), but assume a fixed size in your loop steps.

Option two comes without any interfacing issues, but forces you to use generic VLA vector instructions in the IR and prevents VLS code generation strategies. From a performance point of view, the first option is the most interesting one, but it comes with some interfacing issues.

The problem

For basic arithmetic instructions, using scalable instructions or fixed-length instructions is entirely up to the instruction selector, and function attributes like vscale_range can force the selection of scalable instructions.

For complex, architecture-specific operations (dot products, inner product, outer products, …) we need to generate intrinsics that are only defined for scalable operands. If our code has been generated with fixed-length vectors but we want to rewrite a higher level vector instruction with a hw-specific scalable vector intrinsic, we need a way to cast the incoming fixed-length operands into equivalent scalable vectors, and back from scalable to fixed-length vectors for the result.

For instance:

#gemm_trait = {
    indexing_maps = [
        affine_map<(i, j, k) -> (i, k)>,
        affine_map<(i, j, k) -> (k, j)>,
        affine_map<(i, j, k) -> (i, j)>
    ],
    iterator_types = ["parallel", "parallel", "reduction"]
}

func.func gemm(%a: vector<2x8xi8>, %b: vector<8x2xi8>, %acc: vector<2x2xi32>)
    -> vector<2x2xi32> {
    %0 = vector.contract %a, %b, %c #gemm_trait : vector<2x8xi8>, vector<8x2xi8> into vector<2x2xi32>
    return %0 : vector<2x2xi32>
}

Since it shares its semantics, the vector contraction can be rewritten by the arm_sve.smmla ArmSVE dialect instruction, but while the fixed-length vector.contract takes fixed-length vectors, arm_sve.smmla takes scalable vectors. We need a mechanism to perform that cast.

Solutions in LLVM IR

In LLVM IR, there are a couple of experimental intrinsics, llvm.experimental.vector.insert and llvm.experimental.vector.extract, that allow the insertion/extraction of fixed-length vectors into/from scalable vectors. This way, you can pack fixed-length vectors into scalable vectors, call a scalable vector function or intrinsic, and unpack the result back to fixed-length vectors. Like so:

define void @fl2svmuladd(float *arg0, float *arg1, float *arg2) {
    ; Fixed-length world
    %0 = bitcast float* %arg0 to <8 x float>*
    %1 = bitcast float* %arg1 to <8 x float>*
    %2 = bitcast float* %arg2 to <8 x float>*
    %3 = load <8 x float>, <8 x float>* %0
    %4 = load <8 x float>, <8 x float>* %1
    %5 = load <8 x float>, <8 x float>* %2
    ; Fixed-length to scalable
    %4 = call <vscale x 4 x float> @llvm.experimental.vector.insert.nxv4f32.v8f32(<vscale x 4 x float> undef, <8 x float> %3, i64 0)
    %5 = call <vscale x 4 x float> @llvm.experimental.vector.insert.nxv4f32.v8f32(<vscale x 4 x float> undef, <8 x float> %4, i64 0)
    %6 = call <vscale x 4 x float> @llvm.experimental.vector.insert.nxv4f32.v8f32(<vscale x 4 x float> undef, <8 x float> %5, i64 0)
    ; Scalable world
    %7 = call <vscale x 4 x float> @llvm.fmuladd.nxv4f32(<vscale x 4 x float> %4, <vscale x 4 x float> %5, <vscale x 4 x float> %6)
    ; Scalable to fixed-length
    %8 = call <8 x float> @llvm.experimental.vector.extract.v8f32.nxv4f32(<vscale x 4 x float> %7, i64 0)
    ; Back in fixed-length world
    store <8 x float> %8, <8 x float>* %2
    ret void
}

Interfacing between fixed-length and scalable vectors in MLIR

The question I’d like to ask is, what’s the best way to address this issue of mixed fixed-length and scalable vectors within MLIR?

My assumptions:

We want to do this only when we’re interfacing fixed-length vector code in the Vector dialect with one of the intrinsics in the scalable hw-specific dialects (ArmSVE or RVV for now).
- This will be a common occurrence when using fixed-length vectorization strategies but we want to target a complex non-SIMD SIMD vector operation (like gemm acceleration ops) in a scalable architecture.
For these complex vector operations, the fixed-length to scalable conversion will often be accompanied by a shape conversion
- Since LLVM IR only admits rank-1 vectors but these operations often have multi-rank semantics (like the sdot, who operates in base segments, or smmla, that operates on tiled data), the switch from fixed-length to scalable will be preceded by a flattening of the vector, and the switch from scalable to fixed-length will be succeeded by a reshape from a linear to a multi-rank vector.

Based on these assumptions, although the obvious answer to this question might be to extend vector.insert and vector.extract to accept mixed scalability, and lower those cases to the LLVM IR intrinsics, I believe the operation that makes the most sense for this process is vector.shape_cast.

From a high-level point of view, even if we implement it with these insert/extract constructs, going from fixed-length to scalable is more of a “shape cast” type of operation. Since I anticipate the need for a shape cast anyway, I think that is the operation we should modify for this.

If we go back to the arm_sve.smmla example above, the resulting conversion would be:

func.func gemm(%a: vector<2x8xi8>, %b: vector<8x2xi8>, %acc: vector<2x2xi32>)
    -> vector<2x2xi32> {
    %sa = vector.shape_cast %a : vector<2x8xi8> to vector<[16]xi8>
    %sb = vector.shape_cast %b : vector<8x2xi8> to vector<[16]xi8>
    %sc = vector.shape_cast %c : vector<2x2xi32> to vector<[4]xi32>
    %0 = arm_sve.smmla %sc, %sa, %sb : vector<[16]xi8> to vector<[4]xi32> 
    %res = vector.shape_cast %0 : vector<[4]xi32> to vector<2x2xi32>
    return %res : vector<2x2xi32>
}

Lowering shape cast between scalable and fixed-length vectors

The follow-up question is, how do we lower these mixed scalability shape casts?

The process of casting a scalable vector to a fixed-length vector consists in defining the vscale constant at compile time. That is, going from something like: vector<[4]xf32> to vector<8xf32>, where we are forcing our scalable architecture to be a 256b vector architecture (vscale = 2). Likewise, if we have a vector<8xf32>, a 256b vector, and we want to map it to a scalable architecture with 128b of base vector length, we can trivially do so by dividing the length by the vscale: vector<[4]xf32>.

My proposal is that we can lower these “trivial” shape casts, in which the fixed-length size is a multiple of the base size of the scalable vector, to experimental.vector.insert/extract in the conversion from Vector Dialect to LLVM Dialect. E.g.:

%sv = vector.shape_cast %in : vector<8xf32> to vector<[4]xf32>
%flv = vector.shape_cast %sv : vector<[4]xf32> to vector<8xf32>

Can be trivially lowered to:

%loc = arith.constant dense<0> : vector<[4]xf32>
%slv = llvm.intr.vector.insert %in, %loc[0] : vector<8xf32> into vector<[4]xf32>
%flv = llvm.intr.vector.insert %slv[0] : vector<8xf32> from vector<[4]xf32>

For the non-trivial case, I propose to decompose the lowering into the flattening shape cast + a trivial fixed-length/scalable shape change (and vice versa).

For instance, going back to the arm_sve.smmla, the first lowering step would be:

func.func gemm(%a: vector<2x8xi8>, %b: vector<8x2xi8>, %acc: vector<2x2xi32>)
    -> vector<2x2xi32> {
    %0 = vector.shape_cast %a : vector<2x8xi8> to vector<16xi8>
    %sa = vector.shape_cast %0 : vector<16xi8> to vector<[16]xi8>
    %1 = vector.shape_cats %b : vector<8x2xi8> to vector<16xi8>
    %sb = vector.shape_cast %1 : vector<16xi8> to vector<[16]xi8>
    %2 = vector.shape_cast %c : vector<2x2xi32> to vector<4xi32>
    %sc = vector.shape_cast %2 : vector<4xi32> to vector<[4]xi32>
    %3 = arm_sve.smmla %sc, %sa, %sb : vector<[16]xi8> to vector<[4]xi32> 
    %4 = vector.shape_cast %3 : vector<[4]xi32> to vector<4xi32>
    %5 = vector.shape_cast %4 : vector<4xi32> to vector<2x2xi32>
    return %res : vector<2x2xi32>
}

From there, the fixed-length to fixed-length vector.shape_cast operations are lowered as usual, and the trivial fixed-length to scalable and vice versa, are lowered to llvm.intr.vector.insert/extract.

For a slightly more complex example, if we take one of the operands of arm_sve.smmla for the vscale = 4, that is, a 512b scalable architecture with a base vector size of 128b, the lowering process of one of the operands would be:

Initial:

%sv = vector.shape_cast %a : vector<4x2x8xi8> to vector<[16]xi8>

First lowering step (vector.shape_cast → vector.shape_cast):

%tv = vector.shape_cast %a : vector<4x2x8xi8> to vector<64xi8>
%sv = vector.shape_cast %tv : vector<64xi8> to vector<[16]xi8>

Second lowering step (vector.shape_cast → vector.insert/extract):

%cst = arith.constant dense<0> : vector<64xi8>
%0 = vector.extract %a[0, 0, 0] : vector<4x2x8xi8>
%1 = vector.insert %0, %cst [0] : i8 into vector<64xi8>
%2 = vector.extract %a[0, 0, 1] : vector<4x2x8xi8>
%3 = vector.insert %2, %1 [1] : i8 into vector<64xi8>
%4 = vector.extract %a[0, 0, 2] : vector<4x2x8xi8>
[...]
%126 = vector.extract %a[3, 1, 7] : vector<4x2x8xi8>
%127 = vector.insert %126, %125 [63] : i8 into vector<64xi8>
%sv = vector.shape_cast %127 : vector<64xi8> to vector<[16]xi8>

Last lowering step + canonicalization (vector.insert/extract & vector.shape_cast → LLVM):

    %0 = llvm.mlir.constant(63 : i64) : i64
    %1 = llvm.mlir.constant(62 : i64) : i64
    %2 = llvm.mlir.constant(61 : i64) : i64
    [...]
    %63 = llvm.mlir.constant(0 : i64) : i64
    %cst = arith.constant dense<0> : vector<64xi8>
    %64 = builtin.unrealized_conversion_cast %arg0 : vector<4x2x8xi8> to !llvm.array<4 x array<2 x vector<8xi8>>>
    %65 = llvm.extractvalue %64[0, 0] : !llvm.array<4 x array<2 x vector<8xi8>>>
    %66 = llvm.extractelement %65[%63 : i64] : vector<8xi8>
    %67 = llvm.insertelement %66, %cst[%63 : i64] : vector<64xi8>
    [...]
    %253 = llvm.insertelement %252, %250[%1 : i64] : vector<64xi8>
    %254 = llvm.extractvalue %64[3, 1] : !llvm.array<4 x array<2 x vector<8xi8>>>
    %255 = llvm.extractelement %254[%56 : i64] : vector<8xi8>
    %256 = llvm.insertelement %255, %253[%0 : i64] : vector<64xi8>
    %tmp = arith.constant dense<0> : vector<[16]xi8>
    %sv = llvm.intr.experimental.vector.insert %256, %tmp[0] : vector<64xi8> into vector<[16]xi8>

I’ve already submitted a patch that adds the experimental.vector.insert/extract intrinsics to the LLVM Dialect (D127100), and I’m working on another patch that will extend vector.shape_cast in the way I’ve described above.

Several people and I have already discussed this, publicly and in private, and I’d like to hear opinions from the community.

cc: @dcaballe , @giuseros , @topperc

dcaballe · June 11, 2022, 12:01am

Thank you so much for working on this! This is going to be very helpful for RISC-V as well!

Yes! We are using the exact same idea for RISC-V.

Agreed! Scalable and fixed-length vectors are significantly different concepts and should be represented as such. Providing a mechanism to interface both is reasonable to me.

I think this assumption is too strong. It’s very likely that we grow the Vector dialect to support more scalable operations and scalable/dynamic-VL concepts so I wouldn’t be surprised if we need to interface operations within the Vector dialect and in more generic scenarios, not only VLS/VLA interaction.

javiersetoain:

For these complex vector operations, the fixed-length to scalable conversion will often be accompanied by a shape conversion

Since LLVM IR only admits rank-1 vectors but these operations often have multi-rank semantics (like the sdot, who operates in base segments, or smmla, that operates on tiled data), the switch from fixed-length to scalable will be preceded by a flattening of the vector, and the switch from scalable to fixed-length will be succeeded by a reshape from a linear to a multi-rank vector.

Based on these assumptions, although the obvious answer to this question might be to extend vector.insert and vector.extract to accept mixed scalability, and lower those cases to the LLVM IR intrinsics, I believe the operation that makes the most sense for this process is vector.shape_cast.

I think using a vector.shape_cast is an interesting approach for your use case! However, I think it won’t cover more complex cases. For example, as I mentioned before, we may need to insert a smaller fixed-length vector into a larger scalable vector (similar for extraction):


%sv = llvm.intr.experimental.vector.insert %fixed, %scalable[1] : vector<4xf32> into vector<[4]xf32>

I also think that vector.shape_cast operation is already well-defined and very specific and wouldn’t make it more complex. Crossing the fixed-length/scalable vector boundary also requires extra verification and considerations at optimization level so I advocate for a separation of concerns here.

Regarding extending existing insert/extract ops in the Vector dialect, I would suggest that we add new operations to cross the fixed-length/scalable vector boundary. We have a few flavors of insert/extract operations already and that would also help keep all of them simple and well defined. Perhaps something like vector.insert_into_scalable or vector.extract_from_scalable could work?

Hopefully that helps!

Thanks,
Diego

tschuett · June 11, 2022, 12:44pm

ARM SVE uses the ptrue instruction. It generates masks/predicates of known length:

github.com

llvm/llvm-project/blob/main/llvm/test/CodeGen/AArch64/sve-punpklo-combine.ll#L5


      
          ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
          ; RUN: llc < %s | FileCheck %s
          target triple = "aarch64-unknown-linux-gnu"
          
          
define <vscale x 8 x i1> @masked_load_sext_i8i16(i8* %ap, <vscale x 16 x i8> %b) #0 {
          ; CHECK-LABEL: masked_load_sext_i8i16:
          ; CHECK:       // %bb.0:
          ; CHECK-NEXT:    ptrue p0.b, vl32
          ; CHECK-NEXT:    cmpeq p0.b, p0/z, z0.b, #0
          ; CHECK-NEXT:    punpklo p0.h, p0.b
          ; CHECK-NEXT:    ret
            %p0 = call <vscale x 16 x i1> @llvm.aarch64.sve.ptrue.nxv16i1(i32 10)
            %cmp = call <vscale x 16 x i1> @llvm.aarch64.sve.cmpeq.nxv16i8(<vscale x 16 x i1> %p0, <vscale x 16 x i8> %b, <vscale x 16 x i8> zeroinitializer)
            %extract = call <vscale x 8 x i1> @llvm.vector.extract.nxv8i1.nxv16i1(<vscale x 16 x i1> %cmp, i64 0)
            %ext1 = sext <vscale x 8 x i1> %extract to <vscale x 8 x i16>

tschuett · June 11, 2022, 7:28pm

I believe in most cases scalable operations must be masked or predicated. In your C-style examples you assumed that the length of the arrays are well-behaved. In practice, you would need to create a predicate for the next iteration of the loop.

The ptrue creates a scalable predicate with, e.g., the first 512 bits set to true. Then you can do fixed-style vector operations with scalable vectors ISAs.

zhanghb97 · June 12, 2022, 7:04pm

Hi @javiersetoain - thanks a lot for the RFC and the patch! This can help both ArmSVE and RVV to bridge the gap between VLS code and VLA code.

The proposed method seems to be based on the SVE VLA strategy, but the RVV VLA strategy is different from SVE. RVV does not use the concept of base vector explicitly, and RVV introduces the register group mechanism. The following graph shows more details:

Because of the above differences, the same scalable vector type has different semantics for SVE and RVV.

For example, vector<[16]xi8>

For SVE, it means using one single vector register, and the length is VLEN; the element width is 8.
For RVV, it means using two vector register as a group, and the length is 2xVLEN; the element width is 8. (for more details about the mapping rules, please see here)

As for the following flattening:

javiersetoain:

For a slightly more complex example, if we take one of the operands of arm_sve.smmla for the vscale = 4, that is, a 512b scalable architecture with a base vector size of 128b, the lowering process of one of the operands would be:

Initial:
%sv = vector.shape_cast %a : vector<4x2x8xi8> to vector<[16]xi8>

For RVV side, if the VLEN = 512, we should use

%sv = vector.shape_cast %a : vector<4x2x8xi8> to vector<[8]xi8>

When we design the lowering passes and verification for the shape cast operation, we also need to pay attention to this difference.

It seems that we can only handle the multiple of the base vector cases. Should we add support for non-multiple cases?

dcaballe:

For example, as I mentioned before, we may need to insert a smaller fixed-length vector into a larger scalable vector (similar for extraction):
%sv = llvm.intr.experimental.vector.insert %fixed, %scalable[1] : vector<4xf32> into vector<[4]xf32>

Maybe I missed some context. I don’t get the point why we need to insert a smaller fixed-length vector into a larger scalable vector? Does this “small” and “large” have quantitative mapping rules?

Thanks,
Hongbin

dcaballe · June 13, 2022, 7:01am

zhanghb97:

dcaballe:
For example, as I mentioned before, we may need to insert a smaller fixed-length vector into a larger scalable vector (similar for extraction):
%sv = llvm.intr.experimental.vector.insert %fixed, %scalable[1] : vector<4xf32> into vector<[4]xf32>
Maybe I missed some context. I don’t get the point why we need to insert a smaller fixed-length vector into a larger scalable vector? Does this “small” and “large” have quantitative mapping rules?

It’s very common to combine smaller vector registers into larger ones and the other way around when dealing with mixed-length data types in architectures without register grouping. The mapping depends on the architecture. Some are more flexible than others. For instance, we can natively extract/insert 128-bit and 256-bit subvectors from/into a 512-bit vector.

We could enforce that the subvector insertion/extraction happened on the fixed-length vector only (by using regular fixed-length insert/extract operations), and restrict the fixed-length to scalable-length conversion to be only between same size vectors but the code would be more complicated and I don’t see a good reason for that. LLVM intrinsics allows inserting and extracting subvectors between fixed-length and scalable vectors so we should be able to do the same in MLIR.

nicolasvasilache · June 13, 2022, 7:55am

+1 I wouldn’t add complexity to the vector.shape_cast operation as the specific manipulations for scalable vectors seem to be a strict subset and would create a lot of corner cases in the actual implementation.

I would go with:

vector.shape_cast N-D -> 1-D
vector.scalable.cast/insert/extract 1-D <-> [1]-D
vector.shape_cast 1-D -> N-D

javiersetoain · June 13, 2022, 12:54pm

You raise a good point. I was trying not to add any more operations, hence my assumptions, but it might be worth leaving the functionality to interface between vector types on its own, it also simplifies implementation quite a lot, to be honest

That issue is orthogonal to the scalability of the vectors, and I don’t see it changing anything for this use case. If you don’t specify a mask, it will generate “ptrue” during instruction selection.

That’s a lower level concept. There are no registers in MLIR nor LLVM IR, so something like “register grouping” should not affect what a scalable vector is at this abstraction level. In my examples I’m using 128-bit base vectors out of convenience, but don’t mistake this for an SVE-specific feature. All the examples should work if you double that (for instance, vector<[32]xi8>), and they should work as described for both SVE and RVV.

In RVV, vector<[16]xi8> might imply a grouping of 2 while it’s just a single register in SVE, but the meaning should be the same at Vector Dialect level. Likewise, a vector<[32]xi8> implies two registers in SVE, but at Vector Dialect it shouldn’t affect the semantics.

zhanghb97:

For RVV side, if the VLEN = 512, we should use
%sv = vector.shape_cast %a : vector<4x2x8xi8> to vector<[8]xi8>
When we design the lowering passes and verification for the shape cast operation, we also need to pay attention to this difference.

Notice that this is perfectly doable with the definition above. As I said, I’m using 128-bit base vectors, but this allows for any base vector length. What to do with that is a lower level decision that will invariably change between different architectures.

Non-multiple cases won’t map to llvm.experimental.vector.insert/extract to begin with so, even if you find a use case for that, I’m afraid we would have to work around that, probably by padding. As far as I know, there’s no ongoing discussion about changing that in LLVM IR and I don’t see a clear use case to expand the semantics in MLIR and deal with that during lowering, but, by all means, if you think this has to be dealt with right here, right now, I’m happy to have this conversation

Yes, that was the other clear alternative And the more I think about it (and the more I work on the other patch), the more I like that idea. As a starting point, I’m going to add a scalable ↔ fixed-length cast operation, and we can see how it looks.

Thank you everyone for your feedback and suggestions, they’re greatly appreciated! I’ll get back to you as soon as I’ve got some progress to show.

zhanghb97 · June 13, 2022, 2:06pm

This makes sense, it does depend on specific target hardware. For the same type, different hardware can lead to different results.
For example,

%1 = call <vscale x 4 x float> @llvm.experimental.vector.insert.nxv4f32.v16f32(<vscale x 4 x float> %vec, <16 x float> %subvec, i64 0)

If VLEN = 128, the scalable vector (256b) is smaller than the fixed sub-vector (512b), and this will make LLC crash.
If VLEN = 256, the scalable vector (512b) is equal to the fixed sub-vector (512b)
If VLEN = 512, the scalable vector (1024b) is larger to the fixed sub-vector (512b)

As for the low level decision,

the LLVM backend cannot verify the above case and let LLC crash. Should we add some verification when lowering the cast operation to a target-specific level? or wait for the LLVM backend to handle this?

I agree that we should have a unified meaning in the high-level Vector Dialect, but I cannot think of what a unified scalable vector semantics should be (e.g. what does the “16” in vector<[16]xi8> mean?), and I haven’t found an explanation in the existing documentation (MLIR or LLVM), so I’ve been expressing it in a hardware-specific way

javiersetoain · June 13, 2022, 3:08pm

I think the HW-specific vector dialects (ArmSVE, RVV, etc) can very well take care of this, but only for the intrinsics they cover. I don’t think that’s going to be a problem anyway, because that’s exactly the scenario where we want to interface fixed-length vectors with scalable operations. We can’t prevent that kind of “bad cast” at Vector level, but we can bail on a bad type from the hw-specific dialects. The lowering of the cast can remain hardware-agnostic, keeping the Vector dialect also hardware agnostic.

A scalable vector is a data type containing a multiple of elements of the base size. The 16 in vector<[16]xi8> means “16 i8 elements”. And because it’s scalable, that means there will be a multiple of that number of elements (16, 32, 48, 64, …), defined at runtime.

Notice that “data type” here doesn’t mean “register”. I believe the confusion here comes from equating vector<...> with a vector register, but that’s not necessarily the case. For instance, this:

    %0 = addi %a, %b : vector<[32]xi8>

Is a perfectly valid scalable vector addition. Now, if you try to generate code for SVE, it will be divided into two vector additions, because the base vector size of scalable vectors in SVE is 128b, but the base vector size of that operation is 256b. Maybe for RISC-V Vector you can select a grouped operation and perform the 256b operation with one instruction. The important thing is that we don’t solve this at MLIR nor LLVM IR level, we let the instruction selection to rearrange these operations as needed, in order to fit the ISA.

But this is a different discussion anyway, what these patches try to solve is what happens when all your code is, for whatever reason, fixed-length, but you want to use a scalable intrinsic. Obviously, assuming you can guarantee you won’t be running the generated code on an architecture with a different physical vector size. The compromise is that your code will not be vector length agnostic

zhanghb97 · June 13, 2022, 4:22pm

Thanks a lot! I get the point, the base vector size we use in the Vector dialect is not the base vector size at the hardware level. I confused them before

javiersetoain · June 13, 2022, 4:28pm

Indeed. In the Vector dialect, we have the generic concept of a scalable vector, and that can mean any number of things depending on the target hardware. The fact that I keep using 128b does not help, my apologies. I’ll make a point to use fewer “suspiciously SVE-looking” examples

tschuett · June 13, 2022, 4:49pm

I thought the Vector dialect is target-independent and you do not think about granule sizes. On the way down you go through the RISCV resp. SVE dialect.

dcaballe · June 13, 2022, 5:40pm

Another alternative would be to have a hardware-specific legalization pass in MLIR that takes care of replacing the base vector types with the “right ones” based on the target-specific conventions.

javiersetoain · June 14, 2022, 4:04pm

Hi everyone,

I’ve created a patch, [mlir][vector] Add cast op between scalable and fixed-length vectors, adding an operation to cast between scalable and fixed-length vectors like so:

%0 = vector.scalable_cast %arg0 : vector<16xf32> to vector<[8]xf32>

%1 = vector.scalable_cast %arg1 : vector<[4]xbf16> to vector<4xbf16>

I’ve restricted the operation to rank-1 conversion, and if there’s need to insert/extract shorter vectors into longer ones, I think it’s better if that happens in fixed-length world. For instance:

%0 = vector.insert %v0, %d[0] : vector<4xf32> into vector<2x4xf32>
%1 = vector.insert %v1, %0[1] : vector<4xf32> into vector<2x4xf32>
%2 = vector.shape_cast %1 : vector<2x4xf32> into vector<8xf32>
%3 = vector.scalable_cast %2 : vector<8xf32> to vector<[2]xf32>

That way we avoid re-implementing functionality that’s already well defined for fixed-length vectors in a set of new insert operations for the mixed case.

I still want to add an integration test to make sure it’s doing what it’s supposed to do, but, in the meantime, you can check the patch and suggest changes.

Thank you!

dcaballe · June 14, 2022, 10:50pm

Thanks, that looks much better! However, I still don’t get why we want to diverge from the LLVM implementation for inserting/extracting sub-vectors. Could you elaborate a bit on that? IMO, following the same approach as in LLVM and extending it to support multiple dimensions will keep the design and the lowering conceptually simpler since there will be a 1:1 mapping with LLVM for all the 1-D cases. With the scalable_cast approach we will need 4 operations for inserting/extracting sub-vectors and rely on LLVM to fold those instructions into a single experimental vector insert/extract intrinsic. I’m not sure if that folding is implemented but I would avoid relying on that, if possible.

If you don’t want to go into implementing and validating the sub-vector insertion/extraction, we could turn the current scalable_cast into scalable_insert and scalable_extract operations that only insert/extract into/from position number 0. That would be equivalent to what you have and leave the door open to the future sub-vector insertion/extraction extensions. WDYT?

javiersetoain · June 15, 2022, 9:44am

That was my mistake. The documentation was a bit confusing, it led me to believe that the only use for vector.insert/extract was to insert/extract fixed-length vectors into/from scalable vectors and I didn’t want to duplicate semantics. But, I’ve just checked and the intrinsic works fixed ↔ fixed as well.

I think it would be cleaner if we just add a vector.insert1d/vector.extract1d that can work fixed->fixed and fixed->scalable (and the other way around). The patch won’t be much more complicated than this one, and should allow for other potentially interesting use cases.

Alternatively, we could extend vector.insert/extract to support the insertion/extraction between vectors of the same rank, but different size. This is, I believe, a bit less clean. The constraints for llvm.experimental.vector.insert/extract ops are different than vector.insert/extract and there’s little overlap, so those cases would have to be dealt with on their own (i.e.: have verification code that first decides whether this is a “subvector” operation, and verify one set of constraints; or go for the regular verification code that’s already there). It’d feel a bit like two different things mashed together.

In any case, I’m going to prepare a patch for insert1d/extract1d. If we decided we want those semantics as part of vector.insert/extract, it should be easy enough to merge those two ops into the older ones.

Does that sound good?

nicolasvasilache · June 15, 2022, 1:13pm

It wasn’t clear to me before looking at the impl that scalable_cast was actually not a bitcast but would turn to an insert @0.

I’d recommend dropping the “cast” and only keeping vector.scalable.insert/extract which would only operate on 1d for now.

dcaballe · June 15, 2022, 4:59pm

Yes, this makes sense to me. Let’s have scalable_insert/scalable_extract ops that only work for 1D fixed<->scalable (we can enforce these constraints with tablegen/custom verifiers) and decide later what to do with sub-vectors and n-D vectors when we actually need them. As long as these new ops can evolve to support those cases in the future, I’m fine with it!

WDYT, @javiersetoain?

javiersetoain · June 15, 2022, 4:59pm

I’ve created a new patch (this time as a draft, to avoid pinging reviewers too soon): [mlir][vector] Add vector.insert_1d/extract_1d ops

With this new patch, these are all valid insertions:

%0 = vector.insert_1d %v0, %d[0] : vector<4xf32> into vector<8xf32>
%1 = vector.insert_1d %v1, %0[4] : vector<4xf32> into vector<8xf32>
%2 = vector.insert_1d %1[0] : vector<8xf32> into vector<[4]xf32>

And these are all valid extractions:

%0 = vector.extract_1d %v[0] : vector<8xf32> from vector<[4]xf32>
%1 = vector.extract_1d %0[4] : vector<2xf32> from vector<8xf32>

The example with arm_sve.smmla would look like:

func.func gemm(%a: vector<2x8xi8>, %b: vector<8x2xi8>, %acc: vector<2x2xi32>) -> vector<2x2xi32> {
    %cst = arith.constant dense<0> : vector<[16]xi8>
    %0 = vector.shape_cast %a : vector<2x8xi8> to vector<16xi8>
    %sa = vector.insert_1d %0, %cst[0] : vector<16xi8> into vector<[16]xi8>
    %1 = vector.shape_cast %b : vector<8x2xi8> to vector<16xi8>
    %sb = vector.insert_1d %1, %cst[0] : vector<16xi8> into vector<[16]xi8>
    %2 = vector.shape_cast %c : vector<2x2xi32> to vector<4xi32>
    %sc = vector.insert_1d %2, %cst[0] : vector<4xi32> into vector<[4]xi32>
    %3 = arm_sve.smmla %sc, %sa, %sb : vector<[16]xi8> to vector<[4]xi32> 
    %4 = vector.extract_1d %3[0] : vector<4xi32> from vector<[4]xi32>
    %5 = vector.shape_cast %4 : vector<4xi32> to vector<2x2xi32>
    return %res : vector<2x2xi32>
}

It’s still missing a folder, but since I’ll need to think a bit harder about that one, I’d rather have your opinion before committing more time to this approach.