[RFC] Add built-in support for scalable vector types

javiersetoain · October 14, 2021, 5:43pm

Given that there’s growing interest in handling scalable vectorization in MLIR (see: [RFC] Add RISC-V Vector Extension (RVV) Dialect, and the older proposal: [RFC] Vector Dialects: Neon and SVE), I propose adding support for scalable vectors to the built-in vector type.

Motivation

The two main issues with the current approach used by ArmSVE and the proposed RVV are:

Handling scalable vectors as low level dialect types creates redundancies between different hw dialects, and between those dialects and common ones; e.g.: rewriting Arith operations and their lowering down to LLVM IR on each of the hw-specific dialects, just to get them to accept scalable vector operands
Adapting vectorization and lowering passes to work with scalable vectors requires either making those common dialects dependent on very low level dialects, or duplicating those passes within each of the hw-specific dialects.

These issues make working with scalable vectors extremely cumbersome, and bound to cause a lot of maintainability issues either in scalable dialects, higher level dialects, or both. If the intention is for MLIR to be able to work on all kinds of hardware, we need a way to indicate scalability in vectors across all dialects working with them.

What is a scalable vector type?
A scalable vector type is a SIMD type that stores a number of elements that’s a multiple of a fixed size. The multiplicity of the vector is unknown at compile time, but it’s a constant at runtime. E.g., something like:

%0 = arith.addf %a, %b : vector<4xf32>

indicates the addition of two vectors of 4 single precision floating point elements. On the other hand, if we represent a scalable vector using double angle brackets, something like:

%0 = arith.addf %a, %b : vector<<4xf32>>

indicates the addition of two vectors that contain a multiple of 4 single precision floating point elements. The multiple is a runtime constant represented by vector_scale, and the value is determined by the hardware implementation.

Proposed solution

As a first step, I suggest adding a flag within the VectorType class to represent scalability. This way, everything everywhere else works as it is, and a lot of what’s supported in hw dialects with scalable vector types, work automatically as part of dialects like Vector (load/store) and Arith (arithmetic & comparison operations).

As a proposed syntax for scalable vectors I’ve chosen the double angle brackets, mostly because it is friendlier to the MLIR parser than the syntax adopted by LLVM IR, but also it makes sense if you squint; in most cases, you can think of a scalable vector as an array of vectors, or a vector of vectors. That said, I will happily change it if there’s a better alternative.

With the proposed change, assuming the input parameters are correctly sized, this is how a vector addition function implemented as a vector-length agnostic (VLA) loop would look like:

  func @vector_add(%src_a: memref<?xf32>, %src_b: memref<?xf32>, %dst: memref<?xf32>, %size: index) {
    %c0 = arith.constant 0 : index
    %c4 = arith.constant 4 : index
    %vs = vector_scale : index
    %step = arith.muli %c4, %vs : index
    scf.for %i = %c0 to %size step %step {
      %0 = vector.load %src_a[%i] : memref<?xf32>, vector<<4xf32>>
      %1 = vector.load %src_b[%i] : memref<?xf32>, vector<<4xf32>>
      %2 = arith.addf %0, %1 : vector<<4xf32>>
      vector.store %2, %dst[%i] : memref<?xf32>, vector<<4xf32>>
    }
    return
  }

And using the options -convert-vector-to-llvm -convert-scf-to-std -convert-std-to-llvm -reconcile-unrealized-casts, the following LLVM Dialect can be obtained:

  llvm.func @vector_add(%arg0: !llvm.ptr<f32>, %arg1: !llvm.ptr<f32>, %arg2: i64, %arg3: i64, %arg4: i64, %arg5: !llvm.ptr<f32>, %arg6: !llvm.ptr<f32>, %arg7: i64, %arg8: i64, %arg9: i64, %arg10: !llvm.ptr<f32>, %arg11: !llvm.ptr<f32>, %arg12: i64, %arg13: i64, %arg14: i64, %arg15: i64) {
    %0 = llvm.mlir.undef : !llvm.struct<(ptr<f32>, ptr<f32>, i64, array<1 x i64>, array<1 x i64>)>
    %1 = llvm.insertvalue %arg0, %0[0] : !llvm.struct<(ptr<f32>, ptr<f32>, i64, array<1 x i64>, array<1 x i64>)>
    %2 = llvm.insertvalue %arg1, %1[1] : !llvm.struct<(ptr<f32>, ptr<f32>, i64, array<1 x i64>, array<1 x i64>)>
    %3 = llvm.insertvalue %arg2, %2[2] : !llvm.struct<(ptr<f32>, ptr<f32>, i64, array<1 x i64>, array<1 x i64>)>
    %4 = llvm.insertvalue %arg3, %3[3, 0] : !llvm.struct<(ptr<f32>, ptr<f32>, i64, array<1 x i64>, array<1 x i64>)>
    %5 = llvm.insertvalue %arg4, %4[4, 0] : !llvm.struct<(ptr<f32>, ptr<f32>, i64, array<1 x i64>, array<1 x i64>)>
    %6 = llvm.mlir.undef : !llvm.struct<(ptr<f32>, ptr<f32>, i64, array<1 x i64>, array<1 x i64>)>
    %7 = llvm.insertvalue %arg5, %6[0] : !llvm.struct<(ptr<f32>, ptr<f32>, i64, array<1 x i64>, array<1 x i64>)>
    %8 = llvm.insertvalue %arg6, %7[1] : !llvm.struct<(ptr<f32>, ptr<f32>, i64, array<1 x i64>, array<1 x i64>)>
    %9 = llvm.insertvalue %arg7, %8[2] : !llvm.struct<(ptr<f32>, ptr<f32>, i64, array<1 x i64>, array<1 x i64>)>
    %10 = llvm.insertvalue %arg8, %9[3, 0] : !llvm.struct<(ptr<f32>, ptr<f32>, i64, array<1 x i64>, array<1 x i64>)>
    %11 = llvm.insertvalue %arg9, %10[4, 0] : !llvm.struct<(ptr<f32>, ptr<f32>, i64, array<1 x i64>, array<1 x i64>)>
    %12 = llvm.mlir.undef : !llvm.struct<(ptr<f32>, ptr<f32>, i64, array<1 x i64>, array<1 x i64>)>
    %13 = llvm.insertvalue %arg10, %12[0] : !llvm.struct<(ptr<f32>, ptr<f32>, i64, array<1 x i64>, array<1 x i64>)>
    %14 = llvm.insertvalue %arg11, %13[1] : !llvm.struct<(ptr<f32>, ptr<f32>, i64, array<1 x i64>, array<1 x i64>)>
    %15 = llvm.insertvalue %arg12, %14[2] : !llvm.struct<(ptr<f32>, ptr<f32>, i64, array<1 x i64>, array<1 x i64>)>
    %16 = llvm.insertvalue %arg13, %15[3, 0] : !llvm.struct<(ptr<f32>, ptr<f32>, i64, array<1 x i64>, array<1 x i64>)>
    %17 = llvm.insertvalue %arg14, %16[4, 0] : !llvm.struct<(ptr<f32>, ptr<f32>, i64, array<1 x i64>, array<1 x i64>)>
    %18 = llvm.mlir.constant(0 : index) : i64
    %19 = llvm.mlir.constant(4 : index) : i64
    %20 = "llvm.intr.vscale"() : () -> i64
    %21 = llvm.mul %20, %19  : i64
    llvm.br ^bb1(%18 : i64)
  ^bb1(%22: i64):  // 2 preds: ^bb0, ^bb2
    %23 = llvm.icmp "slt" %22, %arg15 : i64
    llvm.cond_br %23, ^bb2, ^bb3
  ^bb2:  // pred: ^bb1
    %24 = llvm.extractvalue %5[1] : !llvm.struct<(ptr<f32>, ptr<f32>, i64, array<1 x i64>, array<1 x i64>)>
    %25 = llvm.getelementptr %24[%22] : (!llvm.ptr<f32>, i64) -> !llvm.ptr<f32>
    %26 = llvm.bitcast %25 : !llvm.ptr<f32> to !llvm.ptr<vector<<4xf32>>>
    %27 = llvm.load %26 {alignment = 4 : i64} : !llvm.ptr<vector<<4xf32>>>
    %28 = llvm.extractvalue %11[1] : !llvm.struct<(ptr<f32>, ptr<f32>, i64, array<1 x i64>, array<1 x i64>)>
    %29 = llvm.getelementptr %28[%22] : (!llvm.ptr<f32>, i64) -> !llvm.ptr<f32>
    %30 = llvm.bitcast %29 : !llvm.ptr<f32> to !llvm.ptr<vector<<4xf32>>>
    %31 = llvm.load %30 {alignment = 4 : i64} : !llvm.ptr<vector<<4xf32>>>
    %32 = llvm.fadd %27, %31  : vector<<4xf32>>
    %33 = llvm.extractvalue %17[1] : !llvm.struct<(ptr<f32>, ptr<f32>, i64, array<1 x i64>, array<1 x i64>)>
    %34 = llvm.getelementptr %33[%22] : (!llvm.ptr<f32>, i64) -> !llvm.ptr<f32>
    %35 = llvm.bitcast %34 : !llvm.ptr<f32> to !llvm.ptr<vector<<4xf32>>>
    llvm.store %32, %35 {alignment = 4 : i64} : !llvm.ptr<vector<<4xf32>>>
    %36 = llvm.add %22, %21  : i64
    llvm.br ^bb1(%36 : i64)
  ^bb3:  // pred: ^bb1
    llvm.return
  }

Which in turn can be translated to LLVM IR that compiles to valid code for scalable architectures.

I’ve uploaded a patch with the proposed changes to provide some ground for the discussion:
[mlir][RFC] Make scalable vector type a built-in type

Open Issues

Better implementation of the scalable vector type.

In LLVM IR, the type is handled with a sensible class hierarchy, and I believe a similar implementation would also be preferable for MLIR. I’ve started with this to keep the size and the reach of the patch contained, since I expect this one to be a longer discussion and chasing VectorType uses throughout all the dialects for a long period of time can be a significant source of work. That said, once there’s an agreement on how to do this, I’m happy to skip this patch altogether and go straight for an alternative, if that is preferred.

Thank you in advance for you feedback on this topic.

zhanghb97 · October 15, 2021, 10:50am

Thanks for the proposal!

The built-in scalable vector type is important to avoid duplication for different vector architecture abstractions and corresponding lowering passes. IMO, the proposed scalable vector type in the proposal reflects ArmSVE’s VLA strategy that vector length is any multiple of 128 bits between 128 and 2048 bits (I learned this strategy from this paper. If I am wrong, please correct me). However, different architectures use different strategies to achieve the VLA, which means that the information and semantics represented by the scalable vector type are different.

Let me explain what I understand about the different VLA strategies between the SVE and RVV.

SVE (VLA → multiple of the unit vector (vector_scale) is unknown at compile time)

What is a scalable vector type for SVE?

SVE provides 32 vector registers based on 128-bit wide SIMD registers (unit vector), and the length of each vector register (VLEN) is multiple of 128. So the scalable vector type for SVE should provide the information of the unit SIMD register, such as vector<<4xf32>>, vector<<8xi16>>, and so on. The value of multiple can be obtained at runtime, which is exposed to users as vector_scale.

RVV (VLA → maximum number of the unit element (VLMAX) is unknown at compile time.)

What is a scalable vector type for RVV?
A scalable vector type for RVV shows the element type in the vector and how multiple vector registers are grouped. VLMAX is unknown at compile-time but can be calculated at runtime.

RVV provides 32 vector registers and allows to group them as a vector group. Users can get the vl (number of elements for a vector instruction) with the vset{i}vl{i} instruction. When setting the vl, there are some constraints (see section 6.3 of RVV spec for more detials) of AVL (application vector length) and VLMAX. The VLMAX can be calculated with VLEN, LMUL (vector length multiplier), and SEW (selected element width).

VLMAX = LMUL * VLEN / SEW

VLEN is determined at runtime, so the scalable vector type for RVV should provide LMUL and SEW information. That is why I designed new types (e.g. !riscvv.vector<!riscvv.m4, i32>) in my RVV dialect RFC. The LMUL and SEW are not multiplicative, so I used a comma in the type to provide better semantics.

According to the above comparison, I think the current type cannot provide enough information required by RVV. It is very challenging to design a unified scalable vector type and provide suitable semantics for different architectures. IMO, the parameters of the vector type should include the information required by each architecture and take a union. If only RVV and SVE are considered, I think the possible scalable vector type can be as follows:

scalable-vector-type ::= `vector` `<` vector-group `,` unit-vector `>`
vector-group ::= mf8 | mf4 | mf2 | m1 | m2 | m4 | m8
unit-vector ::= `<` decimal-literal `x` vector-element-type `>`
vector-element-type ::= float-type | integer-type | index-type

javiersetoain · October 15, 2021, 11:15am

My apologies, I’m not familiar at all with the details of RISC-V Vector, please bear with me while I wrap my head around it

I based the design not on what SVE requires to operate, but on what LLVM IR understands. As far as I know, the only information you can pass down to LLVM IR is the base vector size (e.g.: 128-bit) and vscale (to handle VLA loops). I am not aware of the efforts by RISC-V people to change that in LLVM IR, but I’ll be happy to take a look if you can point me to them. I’m working bottom-up here:

LLVM IRs vscalecan be expressed in MLIR asvector_scale`
Something like vector<vscale x 8 x i16> in LLVM IR can be expressed as vector<<8xi16>> in MLIR

As far as I understand it, that’s “hardware agnostic” and all you can use to handle scalable vectors in LLVM IR (feel free to correct me). How would something like riscvv.vector<!riscvv.m4, i32> map to LLVM IR, for instance?

Thanks for the comments, they’re greatly appreciated

zhanghb97 · October 15, 2021, 12:47pm

It’s okay, and I just know a little about SVE side by reading the SVE paper. If there is any problem with my understanding about SVE, please let me know

This strategy is natural for SVE because the base vector size and the element type are enough for SVE vector configuration. And the type conversion from mlir to llvm is straightforward.

!arm_sve.vector<4xi32> -> !llvm.vec<? x 4 x i32> -> <vscale x 4 x i32>

However, RVV is not configured like this, which means RVV needs type mapping to be compatible with the current llvm scalable vector type.

As I showed in the RVV dialect RFC, RVV types have a mapping process to generate LLVM type, see here for details.

I also used the bottom-up strategy at the very beginning. You can see the first version of my RFC patch, which used the same way with the SVE side. However, it provides a very bad semantics for RVV.

For example, according to the mapping table, if we want four vector registers to be a group to deal with i32 element type, we should use the following type.

vector<<8xi32>>

This is obviously not the correct semantics for RVV.

I add a type mapping process for lowering pass in my RFC patch.

!riscvv.vector<!riscvv.m4, i32> -> !llvm.vec<? x 8 x i32> -> <vscale x 8 x i32>

The RVV scalable vector type can give more friendly semantics to users and the lowering pass is responsible for generating a correct llvm scalable vector type.

javiersetoain · October 22, 2021, 8:58pm

For those following the discussion, @zhanghb97 and I had a meeting this week to discuss the scalable vector type.

As a quick summary, Hongbin had two concerns about the sufficiency of the proposed type to cover all the semantics of scalable vectors in RISC-V Vector. If I understood correctly (please, correct me if I am wrong):

On top of the scalable vector and the vector scale, common to both RVV and SVE, and easily representable in LLVM IR as vector<vscale x [size] x [type]> and llvm.vscale.* respectively, RVV allows the grouping of vector registers. One of the options, for instance, is doubling the length of your vector registers by cutting in half the number of available registers.

My opinion: I’m not sure how this additional information would be of value at MLIR level, or even at LLVM IR level, since none of these languages understand registers. There’s a “syntax sugar” component to having vector<<m2x8xi16>> as opposed to vector<<16xi16>>, although they are semantically equivalent from LLVM IR’s point of view (both lower to vector<vscale x 16 x i16>`). We will dedicate more time to figure out if there’s anything we can do at MLIR level that can make use of this grouping information, if it can be safely dropped, or if it’s useful as hint for LLVM’s register allocator. In the latter case, we’d need to figure out the best way to pass it down through both MLIR & LLVM IR.

The way RVV handles loop tails is by reducing the value of VLEN down to the number of remaining iterations in the loop. In SVE, there are special instructions that generate a predicate with only the remaining elements of the loop set (while). SVE’s approach is nicely covered by operations like vector.create_mask, but the situation is a bit more awkward for RVV.

In my opinion, I see three options:

We might need some “magic” in LLVM to reverse those tail predicates into a VLEN+Op for RVV
We could have RVV-specific ways to handle tails by having an extra “vlen” parameter instead of/in addition to a mask, and handle that with intrinsics when lowering RVV to LLVM Dialect/LLVM IR.
We could have a “vlen” parameter for all scalable vector operations and, in the lowering to LLVM IR pass, generate a tail mask from that parameter for SVE or an intrinsic to set VLEN for RVV.

The advantage of doing 3, which is technically unnecessary for SVE because you can always do create_mask+masked op., may be having common lowering strategies. I’m not sure that, at that point, it would make much of a difference, but it’s worth considering.

zhanghb97 · October 30, 2021, 4:07am

These two weeks, @javiersetoain and I had meetings to discuss the built-in scalable vector type. I am writing to summarize our discussion. If I am wrong or missing something, please correct me.

Difference between SVE and RVV (only from the perspective of scalable vector type)

The difference between SVE and RVV is a big topic. We only focus on the part related to scalable vector type. Although they are both VLA, the complementary strategies are different. I mentioned this in the previous post, and I prepared a graph to show.

As the graph shows, the RVV vector group setting needs element width and vector length multiplier messages. For example, the vector type for C intrinsic contains SEW and LMUL messages, and there is a mapping to LLVM scalable vector type. The proposed built-in scalable vector type maps directly from the LLVM level, which cannot directly express LMUL, so we need further calculations to get it.

What is the role of LMUL for RVV?

Single vector instruction can operate on multiple vector registers.
When operating on mixed-width values, keeping SEW / LMUL constant （and hence VLMAX constant）
- Fractional LMUL increases the number of effective usable vector register groups.
- Element of different width can be operated on with the same vector length (vl).

Is it necessary to expose LMUL (vector group) parameters in the scalable vector type?

As far as I know, current higher-level abstractions and vectorization passes don’t consider whether the registers are grouped, i.e., there is no clear usage scenario for LMUL parameters. Apart from that, the LMUL setting also cannot benefit the LLVM level because the LMUL message will be lost when it lowers to LLVM IR. So it seems that vector group settings currently can only enhance the semantics of RVV like a “syntax sugar”, which is not suitable for the built-in types. Currently, we can use this proposed scalable vector type, and I think it is enough for RVV in terms of completeness and syntax. If there is a need for vector group setting in future optimization, we can further discuss this.

nicolasvasilache · December 1, 2021, 12:13pm

Pasting from the PR review as this is a better place.

My apologies for the long delay, my biggest problem atm is I do not have a good mental model of how RVV and Arm SVE operate in detail. I had started to read specs but it invariably gets pushed back on the stack as it is not high enough on my priority list.

From a pure cleanup perspective, I generally like it.

From a composability perspective, I think I would prefer to have it spelled out as vector<<4>xf32> or vector<4*xf32> or vector<(4s)xf32>. The rationale is that I think we still want to have n-D scalable vector types in MLIR to allow expressing a statically known number of 1-D scalable vectors that serves as an “unroll-and-jammed vector pack” vector<8x4*xf32>. This would be more consistent with the design of the rest of the vector dialect.

One thing that is higher priority to me personally these days is that we are also exploring using the vector dialect as a programming model for GPUs. In this context, vector<4x8x16*x32*xf32> would also make sense for us.

Bottom line, if we avoided anchoring on the current LLVM / HW implementation that only support 1-D scalable vectors and we made it future-proof in that direction, I am fine with proceeding.

javiersetoain · December 1, 2021, 12:27pm

If it’s any help, we can arrange a quick chat. Getting the basic gist of it should not take long.

I find myself in the same situation regarding GPUs, but if I understand it correctly, what you’re trying to represent there is that dim(2) is a multiple of 16 elements and dim(3) a multiple of 32, correct?

Two questions here:

If that’s the case, is the multiple the same value for both or can it be different?
Do scalable dimension have to be contiguous or can you find yourself in a situation like: vector<4x<8>x4x<16>xf32>?

We’ve had minor discussion internally about how to represent scalable vectors, I picked the double angle brackets for the reasons I mentioned above, but I’m not married to it. I’m happy to change it right now. As it is, though, either the whole vector is scalable (1D or nD, both should be valid), or it is not. If we want to be able to mark a subset of dimensions as scalable, the implementation of the type itself has to change. But that, unless there’s immediate need for it, I’d rather do as a separate patch to keep this one simple. I’m happy to commit to do that myself.

Thanks for your comments!

nicolasvasilache · December 1, 2021, 12:34pm

yes

Different, it would be a dialect-specific thing to pass that information and I wouldn’t expect this to work for LLVM. In the LLVM case, the only legal type would be vector<...x<K>xf32> which would need unrolling to get to vector<<K>xf32>. In other dialects, TBD.

I’d say this is possible even if I don’t have a use case for it atm.

Right, this is too rigid for the use cases I have in mind and does not compose with unroll-and-jam pack of vectors that I very much want to keep using.

javiersetoain · December 1, 2021, 2:49pm

Okay, just to clarify. In the case of a “multi-scalable” n-D vector, I assume that the scale for a particular dimension is the same for all vectors, so we can modify vector_scale to take an index argument to access the scale of each dimension. Something like:

%vscale2 = vector.vector_scale 2

Would return a value representing the scale of the dim(2) in something like vector<4x4x<4>xf32>, and vector<8x2x<8>x4xbf16>. If your code has both types of vectors, the value would be the same in both cases. Is that assumption correct?

There may be an issue with multi-dimensional scalable vectors. LLVM doesn’t support them, but I see their usefulness within MLIR for matrix-matrix-multiply-accumulate style instructions. Arm provides xMMLA instructions that operate on multi-dimensional operands, and can do so in a scalable way. E.g.:

UMMLA takes a vector<<2x8>xi8>>, a vector<<8x2>xi8>, and a vector<<2x2>xi32> to produce a vector<<2x2>xi32>.

In LLVM, the vectors are linearized into 16xi8 and 4xi32, but I think there’s value in keeping the geometry of the operation in MLIR-land. That would require the ability to group multiple dimensions in a single scalable group, because <<2x8>xi8> and <<2>x<8>xi8> do not mean the same thing. Is there a clean way to handle that?

Is it possible that we are trying to reconcile two different levels of abstraction here? Are scalable dimensions for GPUs a very different thing from scalable dimensions in a scalable vector ISA? In case it’s any help, in our scalable ISAs the scale is a hardware “global property”. At runtime it’s fixed, it’s not a parameter you can manipulate during the execution (and that statement comes with some asterisks). These extensions enable the ability of having a single vector instruction set for many different application domains, some of which might benefit from having very long vector units (e.g.: HPC), while others might do better with smaller ones (e.g.: embedded).

nicolasvasilache · December 1, 2021, 3:09pm

I think in general we will want to relax that and that it is target dependent.
For now, LLVM is the only thing that runs end-to-end and I’d favor keeping things as simple as can be.
I wouldn’t change the vector.vector_scale operation for now and leave vector<<K>xf32> as the only legal type for LLVM.

Future “scaling indicator” ops may be more context-sensitive and not make sense for all targets (i.e. almost act like a DataLayout thing but more in the SSA-land). I don’t have a proposal for this except to say let’s not worry about it right now: my first priority is making sure unrolling and vscale play together.

Yes, this is why I’d prefer to keep things as close as possible to what you currently have in your PR but with the extra relaxation for specifying that a subset of dimensions is scalable in both the APIs and the impl. It is perfectly fine to say that only most minor scalable dimension is supported atm.
Later, e.g. for GPU, vscale will likely lower to threadIdx.x/y/z: not everything has to lower the same way when getting to LLVM and I am not too concerned here.

That is further that I am comfortable pushing my thought process atm and not clear how general vs specific this is Can we punt until we get some more tests and experience in our quiver?

Yes we do that heavily and avoiding linearizing too early is a key part of the vector dialect design and comes with its tradeoffs (see 'vector' Dialect - MLIR and relation to flattening to 1-D).

I do not know yet what a reshape that crosses into the vscale domain would look like. In the context of 1-D LLVM vscale this should not be super hard. Throwing in a second vscale dimension (for e.g. GPU since LLVM cannot represent that atm) is in puntland for me atm.

Yes, the local vs global thing is important. I am advocating that we shouldn’t try to reconcile these things prematurely and for now what you have LGTM (modulo the simplest possible n-D extension + “scalable most-minor dim only”).

javiersetoain · December 1, 2021, 5:36pm

Ah! Great! This is exactly what I suspected. I think I understand now where you’re coming from. In your scenario, the scalable dimensions are the dimensions of the thread blocks & grids in a GPU kernel execution. That’s an interesting way of looking at them.

I suspect that would be illegal, but it’s hard for me to think about it in a vacuum. Anyway, for reference, what I mean by, for instance, a vector<<2x4>xf16> and a vector<<4x2>xf16> generating a vector<<2x2>xf32> is:

As you see, you’re independently processing vscale blocks at a time. This has some non-trivial implications when you’re blocking a GEMM, but this is what you can expect from instructions like xMMLA.

As a starting point, I can add extra parameters marking the start of the scalable dimension, so these all would be valid scalable vectors:

vector<4x<8>xf32>
vector<2x4x<8>xf32>
vector<4x<2x2>xf32>
vector<<2x8>xi8>

And these would be invalid:

vector<<2x8>x4xf16>
vector<2x<8>x16xf16>
vector<2x<8>x<16>xf16> // <- Notice this one

Does this work for you for now?

As for notation, if the double angle brackets is not acceptable, I can suggest some alternatives:

vector<4x[4]xf32> or vector<4x[2x4]xf32>
vector<4x[vscale(4)]>xf32> or vector<4x[vscale(2x4)]xf32>
vector<4x<vscale(4)>xf32> or vector<4x<vscale(2x4)>xf32>
...

In all the cases, I’m more or less explicitly grouping dimension lists in “fixed length” and “scalable length”. If we are going to indicate the scalable list textually, I’d try to be explicit. That is, prefer vscale or scale over n or s. Other than that, whatever works for people, this is an easy change. Also, maybe we don’t need to decide right now. We probably have a few months before use spreads beyond ArmSVE/RVV and it’s a quick and easy change. We can pick whatever is not a clearly terrible choice right now and, if we find a clear winner in the near(-ish) future, change it.

nicolasvasilache · December 1, 2021, 6:09pm

That’s a perfect start, thanks!

I actually like the angular bracket personally vector<4x[2x4]xf32>.
I’d avoid spelling vscale in the generic type as this does feel too specific and unconnectable to GPUs.

Unless we hear some stronger opinions here I’d go for square brackets.

Thanks much!

javiersetoain · December 3, 2021, 11:46pm

Hi!

I’ve finally finished the change, and I’ve also reworded the patch. We’re not adding built-in support for scalable vectors, we are adding support for scalable dimensions in VectorType. This can be trivially used to work with scalable vectors, but it opens the possibility of working with vectors with a subset of dimensions whose size depends on a runtime constant. For now, the subset of fixed dimensions has to precede the subset of scalable dimensions, which covers all are current and near-future use cases.

With the current patch, all these cases would be supported:

vector<[4]xf32>
vector<[2x8]xi8>
vector<4x[4]xf32>
vector<2x2x[4x4]xi16>

@nicolasvasilache Please, correct me if I’ve misinterpreted our discussion.

Thank you,
Javier

nicolasvasilache · December 4, 2021, 5:46am

This looks great to me @javiersetoain !

Also FYI @dcaballe, @ThomasRaoux and @matthias-springer as we iterate on vector masking, vector comprehensions with regions and vector programming model for GPU. I expect these topics will intersect with this generic “more dynamic” vector representation.

Topic		Replies	Views
[RFC] Scalable Vectorisation in Linalg MLIR	11	1780	June 10, 2023
[RFC][SVE] Supporting Scalable Vector Architectures in LLVM IR (take 2) LLVM Dev List Archives	12	175	July 14, 2017
[RFC] Interfacing between fixed-length and scalable vectors for VLS vector code on scalable vector architectures MLIR	23	1505	June 16, 2022
[RFC][SVE] Supporting SIMD instruction sets with variable vector lengths LLVM Dev List Archives	77	571	May 24, 2019
Scalable Vector Types in IR - Next Steps? LLVM Dev List Archives	49	153	July 16, 2019

[RFC] Add built-in support for scalable vector types

Related topics