[RFC] Vector Dialects: Neon and SVE

Authors: @nicolasvasilache (Google), @javiersetoain (Arm)

Hello everyone,

we have been discussing with Arm recently and interest has grown towards supporting concrete mixed-precision and scalable operations provided by Neon and SVE. We have built some prototypes to connect the dots and propose to upstream these dialects to make compilation for mobile targets a reality in MLIR. Since these dialects are being developed by, are of interest to the same people, have similar HW targets and sit at similar levels of abstraction, they are proposed for inclusion in a joint RFC.

What is the overall goal of the dialect?

The Vector Dialect document discusses the MLIR vector abstractions and their tradeoffs. The Hardware Vector Ops (HWV) level is provisioned to allow representing non-portable operations and have them interoperate with portable vector operations and MLIR codegen. This proposal is for adding new Targets/Neon and Targets/SVE dialects that would directly model target-specific intrinsics.

This proposal allows 3 concrete things:

  1. make it possible to represent mixed-precision and scalable workloads in MLIR all the way to execution on specific HW.
  2. connect MLIR codegen to these specific abstractions and explore transformations targeting mixed-precision and scalable workloads.
  3. further explore tradeoffs and interop. between HW-specific and HW-agnostic vector abstractions in MLIR.

What is the first implementation milestone?

The first implementation milestone adds the LLVM-level dialects and implements some basic operations (say *mull, *dot, *mmla). Like other intrinsics in the LLVM dialect, they are lightweight and are represented in their custom op form (i.e. no special parsing / printing behavior).

The Tablegen specification resembles:

def LLVM_aarch64_neon_smull :
  LLVMNeon_IntrBinaryOverloadedOp<"smull">, Arguments<(ins LLVM_Type, LLVM_Type)>;

and

def LLVM_aarch64_sve_smmla :
  LLVMSVE_IntrBinaryOverloadedOp<"smmla">,
  Arguments<(ins LLVM_Type, LLVM_Type, LLVM_Type)>;

def LLVM_aarch64_sve_sdot :
  LLVMSVE_IntrBinaryOverloadedOp<"sdot">,
  Arguments<(ins LLVM_Type, LLVM_Type, LLVM_Type)>;

The LLVM dialect form resembles:

llvm.func @neon_smull(%arg0: !llvm.vec<8 x i8>, %arg1: !llvm.vec<8 x i8>) -> !llvm.vec<8 x i16> {
  %0 = "llvm_neon.smull"(%arg0, %arg1) : (!llvm.vec<8 x i8>, !llvm.vec<8 x i8>) -> !llvm.vec<8 x i16>
  llvm.return %0 : !llvm.vec<8 x i16>
}

and

llvm.func @sve_sdot(%arg0: !llvm.vec<? x 16 x i8>, %arg1: !llvm.vec<? x 16 x i8>, %arg2: !llvm.vec<? x 4 x i32>) -> !llvm.vec<? x 4 x i32> {
  %0 = "llvm_sve.sdot"(%arg2, %arg0, %arg1) : (!llvm.vec<? x 4 x i32>, !llvm.vec<? x 16 x i8>, !llvm.vec<? x 16 x i8>) -> !llvm.vec<? x 4 x i32>
  llvm.return %0 : !llvm.vec<? x 4 x i32>
}
llvm.func @sve_add_2(%arg0: !llvm.vec<? x 16 x i8>, %arg1: !llvm.vec<? x 16 x i8>, %arg2: !llvm.vec<? x 4 x i32>) -> !llvm.vec<? x 4 x i32> {
  %0 = "llvm_sve.smmla"(%arg2, %arg0, %arg1) : (!llvm.vec<? x 4 x i32>, !llvm.vec<? x 16 x i8>, !llvm.vec<? x 16 x i8>) -> !llvm.vec<? x 4 x i32>
  llvm.return %0 : !llvm.vec<? x 4 x i32>
}

They have a counterpart operation specified on MLIR 1-D vector types for the purpose of type checking, interop. with portable vector ops as well as codegen + progressive lowering. The Tablegen specification is:

def SMullOp : Neon_Op<"smull", [NoSideEffect,
  AllTypesMatch<["a", "b"]>,
  TypesMatchWith<
    "res has same vector shape and element bitwidth scaled by 2 as a",
    "a", "res", "$_self.cast<VectorType>().scaleElementBitwidth(2)">]> {
  let summary = "smull op";
  let description = [{
    /* Doc to extract from Neon ISA manual */
  }];
  // Supports either:
  //   (vector<8xi8>, vector<8xi8>) -> (vector<8xi16>)
  //   (vector<4xi16>, vector<4xi16>) -> (vector<4xi32>)
  //   (vector<2xi32>, vector<2xi32>) -> (vector<2xi64>)
  let arguments = (ins VectorOfLengthAndType<[8, 4, 2], [I8, I16, I32]>:$a,
                       VectorOfLengthAndType<[8, 4, 2], [I8, I16, I32]>:$b);
  let results = (outs VectorOfLengthAndType<[8, 4, 2], [I16, I32, I64]>:$res);
  let assemblyFormat =
    "$a `,` $b attr-dict `:` type($a) `to` type($res)";
}

def SmmlaOp : SVE_Op<"smmla",
                [NoSideEffect,
                AllTypesMatch<["src1", "src2"]>,
                AllTypesMatch<["acc", "dst"]>,
              ]> {
  let summary = "Matrix-matrix mutiply and accumulate op";
  let description = [{
    The smmla op is an SVE specific op that can lower to the proper LLVMSVE
    operation: `llvm.aarch64.sve.smmla` instruction.
    /* Doc to extract from SVE ISA manual */
  }];
  // Supports vector<16xi8>.
  let arguments = (ins
          ScalableVectorOf<[I32]>:$acc,
          ScalableVectorOf<[I8]>:$src1,
          ScalableVectorOf<[I8]>:$src2
  );
  let results = (outs ScalableVectorOf<[I32]>:$dst);
  let assemblyFormat =
    "$acc `,` $src1 `,` $src2 attr-dict `:` type($src1) `->` type($dst)";
}

The MLIR form for Neon ops composes with existing retargetable vector ops and resembles:

func @neon_smull(%a: vector<8xi8>, %b: vector<8xi8>)
    -> (vector<8xi16>, vector<4xi32>, vector<2xi64>) {

  %0 = neon.smull %a, %b : vector<8xi8> to vector<8xi16>
  %00 = vector.extract_strided_slice %0 {offsets = [3], sizes = [4], strides = [1]}:
    vector<8xi16> to vector<4xi16>

  %1 = neon.smull %00, %00 : vector<4xi16> to vector<4xi32>
  %11 = vector.extract_strided_slice %1 {offsets = [1], sizes = [2], strides = [1]}:
    vector<4xi32> to vector<2xi32>

  %2 = neon.smull %11, %11 : vector<2xi32> to vector<2xi64>

  return %0, %1, %2 : vector<8xi16>, vector<4xi32>, vector<2xi64>
}

On the other hand the SVE dialect defines its own scalable vector type and does not yet compose with existing retargetable vector ops and resembles:

func @sve_smmla(%a: !sve.vector<16xi8>, %b: !sve.vector<16xi8>, %c: !sve.vector<4xi32>) -> !sve.vector<4xi32>
{
  %0 = sve.smmla %c, %a, %b : !sve.vector<16xi8> -> !sve.vector<4xi32>
  return %0 : !sve.vector<4xi32>
}

This is a good starting point to support mixed target-agnostic and target-specific lowering.

Scalable Vector Type Representation

Scalable vectors mirror the syntax of standard vectors. If

vector<4xf32>

is a fixed-length vector containing 4 single precision floating point elements,

!sve.vector<4xf32>

is a scalable vector containing a HW-dependent symbolic multiple of 4 single precision floating point elements.

A scalable vector like the one above converts to a scalable vector in the LLVM Dialect:

!llvm.vec<? x 4 x f32>

Which in turn translates into LLVM IR:

<vscale x 4 x float>

Connection to control-flow (e.g. loops) and memory operations follows VLA-style programmaming which is akin to parametric tiling. The SVE dialect introduces an intrinsic to represent the scale and lower to the appropriate LLVM instruction:

sve.vscale : index

The best way to lower from Vector Dialect to Scalable Vector-based code remains an open question. We expect that the availability of these lower-level vector dialects will help experimentation to determine the best way to lower from high-level fixed-length vector code down to low-level scalable vector code.

In the first incantation, the scalable vector type is confined to the SVE dialect which encompasses ARM-specific instructions. Once we gain more experience with end-to-end codegen of scalable vectors, we expect to separate the type representation so that the infrastructure becomes generally reusable across HW.

How does it fit into the MLIR dialect ecosystem?

Connection: how does it connect to the existing dialects in a compilation pipeline(s)?

The Neon and SVE dialects sit at the HWV layer in the following diagram (extracted from the Vector Dialect document):

The compilation pipeline will start by allowing naive codegen of higher-level ops (e.g. Linalg, loops) that carry the payload information of these new ops.

More elaborate compilation, involving notably scalar-to-vector conversions in the presence of these new ops and scalable vector types, is the subject of ongoing investigations.

Consolidation: is there already a dialect with a similar goal or matching abstractions; if so, can it be improved instead of adding a new one?

There is no current support for ops with mixed-precision and ops on scalable vectors in MLIR core. Additionally, the design of MLIR vector abstractions provisions for target-specific dialects to capture HW-specific variations.

Reuse: how does it generalize to similar but slightly different use-cases?

In the future, we expect the design of the vector dialect and transformations to be influenced by the Neon and SVE dialects and evolve towards more generality than what is allowed today.

Still, even with future evolutions of the vector dialect, target-specific abstractions that allow finer-grained control than what can be achieved with action-at-a-distance through compiler flags, will be useful. High-performance libraries are expected to be simpler to design and interoperate when the proper abstractions are available at the right level in the IR.

Who are the future contributors/maintainers beyond those who propose the dialect?

It is expected that these dialects will be a generally useful abstraction layer to the MLIR community. While it is the goal that the community itself will contribute to extending and maintaining the abstractions, for the foreseeable future we expect Google and Arm to contribute and maintain these dialects.

Current Status

Prototypes for the Neon dialect and the SVE dialect are available and proposed for upstreaming.
They already allow the generation of the expected assembly.

Neon

From the MLIR input:

func @neon_smull(%a: vector<8xi8>, %b: vector<8xi8>)
    -> (vector<8xi16>, vector<4xi32>, vector<2xi64>) {
  %0 = neon.smull %a, %b : vector<8xi8> to vector<8xi16>
  %00 = vector.extract_strided_slice %0 {offsets = [3], sizes = [4], strides = [1]}:
    vector<8xi16> to vector<4xi16>

  %1 = neon.smull %00, %00 : vector<4xi16> to vector<4xi32>
  %11 = vector.extract_strided_slice %1 {offsets = [1], sizes = [2], strides = [1]}:
    vector<4xi32> to vector<2xi32>

  %2 = neon.smull %11, %11 : vector<2xi32> to vector<2xi64>

  return %0, %1, %2 : vector<8xi16>, vector<4xi32>, vector<2xi64>
}

Convert to LLVM Dialect with mlir-opt -convert-neon-to-llvm:

module  {
  llvm.func @neon_smull(%arg0: !llvm.vec<8 x i8>, %arg1: !llvm.vec<8 x i8>) -> !llvm.struct<(vec<8 x i16>, vec<4 x i32>, vec<2 x i64>)> {
    %0 = "llvm_neon.smull"(%arg0, %arg1) : (!llvm.vec<8 x i8>, !llvm.vec<8 x i8>) -> !llvm.vec<8 x i16>
    %1 = llvm.shufflevector %0, %0 [3, 4, 5, 6] : !llvm.vec<8 x i16>, !llvm.vec<8 x i16>
    %2 = "llvm_neon.smull"(%1, %1) : (!llvm.vec<4 x i16>, !llvm.vec<4 x i16>) -> !llvm.vec<4 x i32>
    %3 = llvm.shufflevector %2, %2 [1, 2] : !llvm.vec<4 x i32>, !llvm.vec<4 x i32>
    %4 = "llvm_neon.smull"(%3, %3) : (!llvm.vec<2 x i32>, !llvm.vec<2 x i32>) -> !llvm.vec<2 x i64>
    %5 = llvm.mlir.undef : !llvm.struct<(vec<8 x i16>, vec<4 x i32>, vec<2 x i64>)>
    %6 = llvm.insertvalue %0, %5[0] : !llvm.struct<(vec<8 x i16>, vec<4 x i32>, vec<2 x i64>)>
    %7 = llvm.insertvalue %2, %6[1] : !llvm.struct<(vec<8 x i16>, vec<4 x i32>, vec<2 x i64>)>
    %8 = llvm.insertvalue %4, %7[2] : !llvm.struct<(vec<8 x i16>, vec<4 x i32>, vec<2 x i64>)>
    llvm.return %8 : !llvm.struct<(vec<8 x i16>, vec<4 x i32>, vec<2 x i64>)>
  }
}

Translate to LLVM IR with mlir-translate -neon-mlir-to-llvmir:

define { <8 x i16>, <4 x i32>, <2 x i64> } @neon_smull(<8 x i8> %0, <8 x i8> %1) !dbg !3 {
  %3 = call <8 x i16> @llvm.aarch64.neon.smull.v8i16(<8 x i8> %0, <8 x i8> %1), !dbg !7
  %4 = shufflevector <8 x i16> %3, <8 x i16> %3, <4 x i32> <i32 3, i32 4, i32 5, i32 6>, !dbg !9
  %5 = call <4 x i32> @llvm.aarch64.neon.smull.v4i32(<4 x i16> %4, <4 x i16> %4), !dbg !10
  %6 = shufflevector <4 x i32> %5, <4 x i32> %5, <2 x i32> <i32 1, i32 2>, !dbg !11
  %7 = call <2 x i64> @llvm.aarch64.neon.smull.v2i64(<2 x i32> %6, <2 x i32> %6), !dbg !12
  %8 = insertvalue { <8 x i16>, <4 x i32>, <2 x i64> } undef, <8 x i16> %3, 0, !dbg !13
  %9 = insertvalue { <8 x i16>, <4 x i32>, <2 x i64> } %8, <4 x i32> %5, 1, !dbg !14
  %10 = insertvalue { <8 x i16>, <4 x i32>, <2 x i64> } %9, <2 x i64> %7, 2, !dbg !15
  ret { <8 x i16>, <4 x i32>, <2 x i64> } %10, !dbg !16
}

; Function Attrs: nounwind readnone
declare <8 x i16> @llvm.aarch64.neon.smull.v8i16(<8 x i8>, <8 x i8>) #0

; Function Attrs: nounwind readnone
declare <4 x i32> @llvm.aarch64.neon.smull.v4i32(<4 x i16>, <4 x i16>) #0

; Function Attrs: nounwind readnone
declare <2 x i64> @llvm.aarch64.neon.smull.v2i64(<2 x i32>, <2 x i32>) #0

Compile to Aarch64 assembly with llc -O3 -mtriple=aarch64-none-linux-gnu -mattr=+neon:

neon_smull:                             // @neon_smull
.Lfunc_begin0:
        .file   1 "/usr/local/google/home/ntv/github/llvm-project/build/<stdin>"
        .loc    1 2 0                           // <stdin>:2:0
        .cfi_startproc
// %bb.0:
        .loc    1 3 10 prologue_end             // <stdin>:3:10
        smull   v0.8h, v0.8b, v1.8b
        .loc    1 4 10                          // <stdin>:4:10
        ext     v1.16b, v0.16b, v0.16b, #6
        .loc    1 5 10                          // <stdin>:5:10
        smull   v1.4s, v1.4h, v1.4h
        .loc    1 6 10                          // <stdin>:6:10
        ext     v2.16b, v1.16b, v1.16b, #4
        .loc    1 7 10                          // <stdin>:7:10
        smull   v2.2d, v2.2s, v2.2s
        .loc    1 12 5                          // <stdin>:12:5
        ret

SVE

From the MLIR input:

func @sve_sdot(%a: !sve.vector<16xi8>, %b: !sve.vector<16xi8>, %c: !sve.vector<4xi32>) -> !sve.vector<4xi32>
{
  %0 = sve.sdot %c, %a, %b : !sve.vector<16xi8> -> !sve.vector<4xi32>
  return %0 : !sve.vector<4xi32>
}

func @sve_smmla(%a: !sve.vector<16xi8>, %b: !sve.vector<16xi8>, %c: !sve.vector<4xi32>) -> !sve.vector<4xi32>
{
  %0 = sve.smmla %c, %a, %b : !sve.vector<16xi8> -> !sve.vector<4xi32>
  return %0 : !sve.vector<4xi32>
}

func @sve_udot(%a: !sve.vector<16xui8>, %b: !sve.vector<16xui8>, %c: !sve.vector<4xui32>) -> !sve.vector<4xui32>
{
  %0 = sve.udot %c, %a, %b : !sve.vector<16xui8> -> !sve.vector<4xui32>
  return %0 : !sve.vector<4xui32>
}

func @sve_ummla(%a: !sve.vector<16xui8>, %b: !sve.vector<16xui8>, %c: !sve.vector<4xui32>) -> !sve.vector<4xui32>
{
  %0 = sve.ummla %c, %a, %b : !sve.vector<16xui8> -> !sve.vector<4xui32>
  return %0 : !sve.vector<4xui32>
}

func @get_vscale() -> index
{
  %0 = sve.vscale : index
  return %0 : index
}

Convert to LLVM Dialect with mlir-opt -convert-sve-to-llvm:

module  {
  llvm.func @sve_sdot(%arg0: !llvm.vec<? x 16 x i8>, %arg1: !llvm.vec<? x 16 x i8>, %arg2: !llvm.vec<? x 4 x i32>) -> !llvm.vec<? x 4 x i32> {
    %0 = "llvm_sve.sdot"(%arg2, %arg0, %arg1) : (!llvm.vec<? x 4 x i32>, !llvm.vec<? x 16 x i8>, !llvm.vec<? x 16 x i8>) -> !llvm.vec<? x 4 x i32>
    llvm.return %0 : !llvm.vec<? x 4 x i32>
  }
  llvm.func @sve_smmla(%arg0: !llvm.vec<? x 16 x i8>, %arg1: !llvm.vec<? x 16 x i8>, %arg2: !llvm.vec<? x 4 x i32>) -> !llvm.vec<? x 4 x i32> {
    %0 = "llvm_sve.smmla"(%arg2, %arg0, %arg1) : (!llvm.vec<? x 4 x i32>, !llvm.vec<? x 16 x i8>, !llvm.vec<? x 16 x i8>) -> !llvm.vec<? x 4 x i32>
    llvm.return %0 : !llvm.vec<? x 4 x i32>
  }
  llvm.func @sve_udot(%arg0: !llvm.vec<? x 16 x i8>, %arg1: !llvm.vec<? x 16 x i8>, %arg2: !llvm.vec<? x 4 x i32>) -> !llvm.vec<? x 4 x i32> {
    %0 = "llvm_sve.udot"(%arg2, %arg0, %arg1) : (!llvm.vec<? x 4 x i32>, !llvm.vec<? x 16 x i8>, !llvm.vec<? x 16 x i8>) -> !llvm.vec<? x 4 x i32>
    llvm.return %0 : !llvm.vec<? x 4 x i32>
  }
  llvm.func @sve_ummla(%arg0: !llvm.vec<? x 16 x i8>, %arg1: !llvm.vec<? x 16 x i8>, %arg2: !llvm.vec<? x 4 x i32>) -> !llvm.vec<? x 4 x i32> {
    %0 = "llvm_sve.ummla"(%arg2, %arg0, %arg1) : (!llvm.vec<? x 4 x i32>, !llvm.vec<? x 16 x i8>, !llvm.vec<? x 16 x i8>) -> !llvm.vec<? x 4 x i32>
    llvm.return %0 : !llvm.vec<? x 4 x i32>
  }
  llvm.func @get_vscale() -> !llvm.i64 {
    %0 = "llvm_sve.vscale"() : () -> !llvm.i64
    llvm.return %0 : !llvm.i64
  }
}

Translate to LLVM IR with mlir-translate -sve-mlir-to-llvmir:

define <vscale x 4 x i32> @sve_sdot(<vscale x 16 x i8> %0, <vscale x 16 x i8> %1, <vscale x 4 x i32> %2) !dbg !3 {
  %4 = call <vscale x 4 x i32> @llvm.aarch64.sve.sdot.nxv4i32(<vscale x 4 x i32> %2, <vscale x 16 x i8> %0, <vscale x 16 x i8> %1), !dbg !7
  ret <vscale x 4 x i32> %4, !dbg !9
} 

define <vscale x 4 x i32> @sve_smmla(<vscale x 16 x i8> %0, <vscale x 16 x i8> %1, <vscale x 4 x i32> %2) !dbg !10 {
  %4 = call <vscale x 4 x i32> @llvm.aarch64.sve.smmla.nxv4i32(<vscale x 4 x i32> %2, <vscale x 16 x i8> %0, <vscale x 16 x i8> %1), !dbg !11
  ret <vscale x 4 x i32> %4, !dbg !13
} 

define <vscale x 4 x i32> @sve_udot(<vscale x 16 x i8> %0, <vscale x 16 x i8> %1, <vscale x 4 x i32> %2) !dbg !14 {
  %4 = call <vscale x 4 x i32> @llvm.aarch64.sve.udot.nxv4i32(<vscale x 4 x i32> %2, <vscale x 16 x i8> %0, <vscale x 16 x i8> %1), !dbg !15
  ret <vscale x 4 x i32> %4, !dbg !17
} 

define <vscale x 4 x i32> @sve_ummla(<vscale x 16 x i8> %0, <vscale x 16 x i8> %1, <vscale x 4 x i32> %2) !dbg !18 {
  %4 = call <vscale x 4 x i32> @llvm.aarch64.sve.ummla.nxv4i32(<vscale x 4 x i32> %2, <vscale x 16 x i8> %0, <vscale x 16 x i8> %1), !dbg !19
  ret <vscale x 4 x i32> %4, !dbg !21
} 

define i64 @get_vscale() !dbg !22 {
  %1 = call i64 @llvm.vscale.i64(), !dbg !23
  ret i64 %1, !dbg !25
} 

declare <vscale x 4 x i32> @llvm.aarch64.sve.sdot.nxv4i32(<vscale x 4 x i32>, <vscale x 16 x i8>, <vscale x 16 x i8>) #0
declare <vscale x 4 x i32> @llvm.aarch64.sve.smmla.nxv4i32(<vscale x 4 x i32>, <vscale x 16 x i8>, <vscale x 16 x i8>) #0
declare <vscale x 4 x i32> @llvm.aarch64.sve.udot.nxv4i32(<vscale x 4 x i32>, <vscale x 16 x i8>, <vscale x 16 x i8>) #0
declare <vscale x 4 x i32> @llvm.aarch64.sve.ummla.nxv4i32(<vscale x 4 x i32>, <vscale x 16 x i8>, <vscale x 16 x i8>) #0
declare i64 @llvm.vscale.i64() #1

attributes #0 = { nounwind readnone }
attributes #1 = { nofree nosync nounwind readnone willreturn }

Compile to Aarch64 assembly:

    // [...]
sve_sdot:                               // @sve_sdot
    sdot    z2.s, z0.b, z1.b
    mov z0.d, z2.d
    ret
    // [...]
sve_smmla:                              // @sve_smmla
    smmla   z2.s, z0.b, z1.b
    mov z0.d, z2.d
    ret
    // [...]
sve_udot:                               // @sve_udot
    udot    z2.s, z0.b, z1.b
    mov z0.d, z2.d
    ret
    // [...]
sve_ummla:                              // @sve_ummla
    ummla   z2.s, z0.b, z1.b
    mov z0.d, z2.d
    ret
    // [...]
get_vscale:                             // @get_vscale
    rdvl    x8, #1
    lsr x0, x8, #4
    ret
    // [...]

One way to compile the generated LLVM IR to Aarch64 could be:

 llc -march=aarch64 -mattr=v8.6a,sve

In this case, the minimum requirements for xMMLA instructions are the v8.6a and sve attribute flags.

4 Likes

This is really nice! I’m curious if we think that there is a VLA abstraction we can extract that would work across vendors?

Thanks for the proposal! I have a few general comments/questions:

Naming

These dialects look very arm specific, but have very general names. Can we add a arm_ or aarch64_ to the dialects to prevent confusion on what they are intended to be used for and name conflicts with things that are general? I hope that we can establish a convention and do this for all target specific dialects. I would also prefer that we have a sub directory for dialects containing abstractions for a particular target.

Direction / Purpose

Neon

I understand the desire to have a dialect for specific targets that support higher level type systems, but a general concern I have is; are we going to end up with N copies of these dialects for the N different type systems that want to interface with them? If so, then how do we end to cope with this? If not, then how we can construct a rationale to prevent this? As we add more target dialects, do we have constraints/rationale on what abstraction level they should target? LLVM has the benefit of having a single type system so target intrinsics have to fit within that, but that isn’t necessarily enforced in MLIR. If we want these target dialects to compose at the same abstraction level we need to explicitly codify it.

The above also goes a bit into how reusable this dialect(and those of similar nature) is intended to be. For example, could FIR or a hypothetical ClangIR dialect make use of this dialect as part of its compilation flow? The diagram you provide for how this fits into the ecosystem only has mention of compiling from a high level tensor language, and not other domains or programming languages. I would prefer to also see how this seemingly general dialect intends to fit outside of the HLO-esque domain. The MLIR ecosystem encompasses much more than what is shown in the diagram.

SVE

This currently has a very general name, but seems(from the prototype) hard coded to arm intrinsics in LLVM. Do we forsee this generalizing into something not arm specific? i.e. something that can support VLA in other architectures.

– River

Thank you for comments! :slight_smile: I’ll give you my take on this, and I’ll let Nicolas to add or clarify if missed something:

I believe there is, but I’m very familiar with SVE and barely familiar at all with RISC-V vector instructions. I think the vector type, the vscale primitive, and any basic transformations from fixed-length to scalable length will be architecture-agnostic. I think those should eventually live in Vector, conceptually that’s what Vector is there for. Things like predicates (generation & execution), first-faulting and non-faulting memory accesses… that could be a bit trickier, it will require some careful thinking (and maybe compromise).

That makes a lot of sense. If the convention doesn’t exist, should we just come up with something now? Open another RFC? Do we move forward as it is and fix it later or do we wait until the convention is clear before proceeding? I’m partial to moving forward and fix it later, or decide on a convention now and fix it now :slight_smile:

The way I see it, the purpose of these low level dialects is to be able to target specific hardware features from within MLIR. You are likely to have more semantic information within MLIR than LLVM IR, and that opens new opportunities for effective vectorization. The alternative is leaving everything in general vector form and let LLVM figure out how to best use the available hardware features. If we want to achieve performance levels on par with hand-optimized code, we may want to take advantage of the extra domain knowledge provided by MLIR.

I don’t see why not. A compiler targeting a processor with features exposed by a hw-specific dialect could either target those features through that dialect, or ignore them and let LLVM take care of it. But maybe there’s something I’m missing/not seeing?

SVE sounds non-specific (Scalable Vector Extension), but it’s the proper name given to the ISA extension by Arm. This doesn’t preclude your point on having a better naming convention, you do raise a good point, I just wanted to clarify :slight_smile: As for the question, I do foresee the generalization of some part of this dialect into Vector. What part and how is not 100% clear (to me) yet. The two problems I have are:

  1. I can’t generate scalable vector code from MLIR as it is right now
  2. I don’t know what the best way to go from fixed-length to VLA code is

In order to try different things, find gaps, and come up with the best most general possible solution to 2, we need at least 1. For that, we can either play with the design of Vector, hammer things in, chisel things out, until we find the best way to go about it; or we can do all of that within this target-specific dialect and move things up the abstraction ladder as they become clearer. Ideally, we’re eventually left with a Vector dialect that supports VLA vectorization, and a target specific dialect that exposes features that are too SVE-specific to include in a higher level common dialect.

Let me clarify, my top priority is to be able to target scalable vector length architectures. After some discussion, we thought that the best way was to start from a specific scalable vector architecture and work our way up. With this approach we tackle two issues at once:

  1. How to do VLA in MLIR
  2. How to generate efficient code for SVE-based processors

Does this make sense?

Cheers!
Javier

Thanks Javier, the explanations help a lot!

I think we can decide on a convention now, given that we have a driving use case.

The point I was trying to make is that being “in MLIR” is not a well defined concept from the context of being at a particular abstraction level within a compilation flow. Put in a different way, should we try to ensure that all “target”-esque dialects operate at the same level of abstraction(the HWV layer? in this case) within MLIR?

This is why I’m generally not on board with explicitly hard coding an operations semantics to a specific LLVM intrinsic/instruction/etc. When you do this, you bake in specific design choices and restrictions that are often specific to that IR which prevents natural evolution within MLIR. Taking advantage of the information within MLIR can translate to different things, and can affect how we many want to represent a particular operation or metadata. I would hate to end up in a world where the evolution of MLIR is blocked by the state of another IR.

This point corresponds to the abstraction level within MLIR that we choose for these target dialects. It’s important to remember that an “MLIR compiler” encompasses many different abstraction levels, so the point at which a compiler targets something can vary depending on the compiler. The compilation flow of MLIR is not exactly 1-1 with how traditional LLVM compilers operate.

Yes, I am familiar what it means in this context. The point that was sticking to me is that this is specific to arm, and the fact that MLIR encompasses many different domains often means that acronyms overlap more frequently than you would expect.

This is a totally fine approach, and I’m not against it in the slightest. My main concerns were built around my confusion to the general name given to the dialect and seeing specific mentions of arm. That is all cleared up now.

Thanks again Javier, for the very detailed responses. FWIW, I’m +1 on the proposal in general. I’m very excited about this work moving forward.

– River

Thanks for the comments, here are a few cents from me :slight_smile:

+1 that’s what we want to build towards.

How about we isolate these new target-specific dialect under Dialect/Target/ArmNeon and Dialect/Target/ArmSve (modulo the capitalization and numbering (32/64/v7/8/9?) you prefer) for now and refine later?

I do not see evidence that N > 1 and I would see it as a big red flag which would require a lot of thought if someone proposed that. The evolution we are targeting is that the scalable vector type will be moved to vector once transformations are understood.

Yes absolutely, vector abstractions are “payload ops” and can compose with “control” ops. This is why the part about the tensor language parts are “ghosty” in the diagram. This is also described in the vector dialect doc: “The following diagram seeks to isolate vector dialects from the complexity of the codegen paths and focus on the payload-carrying ops that operate on std and vector types. This diagram is not to be taken as set in stone and representative of what exists today but rather illustrates the layering of abstractions in MLIR.”

AFAICS we want all target dialects to operate on core types, with a little flexibility to allow partial compositions. Now for SVE there is a new type that we are not yet ready to propose to core. Even if we did and the type was in core things would still not compose until we have the transformations to connect scalars and fixed size vectors to scalable vectors. Still, even in the absence of that, control-flow ops would compose with these vector ops. We would not be able to go through memory with them yet so it is likely that we will push interop. through mechanical fixed vector of prescribed size -> scalable vector for now (with a runtime assert that vector size is a multiple of vscale). Once these are sorted out, will come the time to move the scalable type to core and make it work with memref and buffers.

Now I agree that this may be overkill as we just want a bit on the vector type: an alternative is to just go for it in core now and fail to legalize anything that is a scalable vector for now. I would prefer we not start mixing things too early though because we are not sure this is the right model for MLIR.

I think generally, the red flags we want to avoid are designing new load/store ops and buffer types. This is where keeping things opaque in MLIR and resolving at the LLVM level has worked well to connect the dots without building abstractions that will never compose. A concrete example is we will likely have an op to convert a vector value to a scalable vector value and back, this will likely go through an LLVM bitcast and be resolved late.

Yes, the “vscale” op will likely become standard when the scalable type becomes builtin (?). But we likely want to have some signal on the transformations side that things are reasonable first.

I think for “sequential payload carrying ops” it is relatively easy to say they should be as close as possible to the core/builtin types and avoid introducing load/store/buffers and instead hide the abstraction switch at the LLVM level and optimize later. For other things e.g. TESSERACT is a PIM-enabled parallel graph processing architecture based on Micron’s Hybrid Memory Cube (HMC) , I don’t know yet but we can think of gradually relaxing things as it makes sense.

I believe some of this is related to Exposing LLVM ops to the MLIR type system: I would love a simpler way to make llvm ops that we don’t intend to generalize to live in a layer where they could take arbitrary MLIR or LLVM types. The decision of going to a particular target-specific op likely needs to start as a target-specific rewrite. As we understand common properties in different targets we refactor the logic into interfaces and decisions, which seems aligned with the evolution of MLIR. I imagine in the future there will be needs for evolving interfaces to account for heuristics/cost models etc. There is also the greedy vs dynamic programming vs others pattern selection lurking around there.

Does “Sequential payload carrying ops on core types” + ability to introduce types resolved at the LLVM level + restriction on load/store/alloc/, as outlined above, seem like a reasonable first cut? This seems like the minimal incremental step for SVE.

Another point to raise here is the one about signedness.
The examples above use signed ints but they could have also used signless ints.

It seems there are a few tradeoffs heres:

  1. if we manage to avoid duplication of the ops by having LLVM instructions that can also take standard MLIR types then the ops already have a signed and unsigned version that can work on signless integers.
  2. alternatively if we don’t avoid duplication of ops, it seems undesirable to duplicate each version of the LLVM ops: we could use signed operands + lowering based on operand types.

Irrespective of that tradeoff, when trying to connect pieces, I did not find a way to convert “with”-sign to signless and back. In turn this makes it hard to perform basic operations such as sext/zext/trunc when operating on “with”-sign types. This led to https://reviews.llvm.org/D92234 which apparently goes in the wrong direction. What is the proper way to make “with”-sign types connect to sext/zext/trunc?
Looking around I do not see much use of “with”-sign types in core, has this been added prematurely?

Do any existing design decision change in light of the mixed-precision use case ?

I wondered about this choice when reading the RFC. I believe that LLVM is quite bought into the “ops carry any necessary signed-ness info” pov. Afaik, the sign carrying types largely exist for interop with layers of the system that take a different opinion, and they are useful at the very high level (think: at a language representation level where the signed-ness is type carried). There is very little support for them in core, and you’ll note that the standard dialect cannot even represent a signed/unsigned constant (by design).

I’d think you want to follow llvm norms here and go completely signless.

I agree.

If anyone is curious about ancient history, this talk explains some of the reasons that LLVM IR switched from signed to signless types. I think it makes sense to keep these “slightly higher level than LLVM” dialects to signless types.

The signed types were intended for the element type of tensor when working with frameworks like TF, and for source language dialects. Mid-level IR applications should stick with signless IMO.

-Chris

Can we autogenerate these dialects based on the LLVM intrinsic definitions or clang’s intrinsic definitions?

It seems like a couple hundred lines of code would be enough to generate a fairly complete interop layer for the LLVM-mirroring part (we do something similar in npcomp for the aten dialect). That seems relatively uncontroversial modulo naming. If the autogeneration approach is achievable (and I think it is), then I think we should definitely approach it that way.

If we peel the LLVM-mirroring part off from the proposal, what’s left? From my reading of the RFC the main point of contention (ignoring naming / directory layout) is the “slightly higher than LLVM interop” layers that just mirror the autogenerated llvm interop layers into something slightly more idomatic for MLIR? (it definitely seems like a “trivially translatable” subset / op interface as discussed in Exposing LLVM ops to the MLIR type system could remove the need to have two nearly identical dialects, each eventually with hundreds/thousands of ops).

The LLVM part is easy, the MLIR facing part much less so I think, see this post.

Unfortunately I do not see an easy way for ops that are not trivially 1-1 (see post).
The vector ops want OpInterfaces to talk to vector transformations as well as nice MLIR type error messages vs just break at runtime when trying to construct the LLVM type.

My bad, flipped the bit in the wrong direction when unifying SVE and Neon text for the proposal. Reverting.

Thanks, after looking a bit deeper, this seems like one of the classical tradeoff between “passing static information via a) type, b) attributes, c) op semantics or d) structure in the IR”. In this case, the problem seems to clearly favor op semantics. I imagine it involved some considerations at function boundaries but signed/unsigned/signless have the same size so maybe the considerations reduced to nothing.

Hello guys,

I would have some curious questions regarding your effort going forward:

  • How can XLA (AoT) benefit from this or make sense to use it for XLA(AoT)?
  • Is this effort beneficiary somehow for XNNPACK tflite delegate?
  • Do you have any intention to support the ARM Microcontrollers, Cortex-M family especially?

I would appreciate your insights.

Thank you.

XLA is definitely something we’d like to evolve to use more end-to-end MLIR CodeGen. At the moment we isolate some of the MLIR code in here: https://github.com/tensorflow/mlir-hlo ; and I expect 2021 to see significant progress there!

I am not familiar with the challenge associated with Cortex-M, looking quickly it seems that don’t support floating point and MMU is optional, is that it?

Right.

Well signedness does have ABI impact, so it did also lead to the signext/zeroext attributes on functions to handle this. That said, ABI needs a whole collection of attributes (given the LLVM design, which is… suboptimal) anyway, so this wasn’t enough to tip the balance. This doesn’t negate your point though!

The key driver of this design was that “add int x,y” had the same semantics and behavior as “add uint x,y” and it was/is valuable to represent them as the same operation. In the LLVM 1.x timeframe, the lack of this property caused tons of trivial pessimizations, for example when noop casts from “int to uint” would block a peephole optimization.

-Chris

1 Like

So it depends on the flavor of CortexM but referred to ones that have FPU.