[RFC] Supporting ARM's SVE in LLVM

Hi,

We've been working for the last two years on support for ARM's Scalable Vector Extension in LLVM, and we'd like to upstream our work. We've had to make several design decisions without community input, and would like to discuss the major changes we've made. To help with the discussions, I've attached a technical document (also in plain text below) to describe the semantics of our changes. We've made the entire patch available as a github repository at https://github.com/ARM-software/LLVM-SVE and our presentation slides and video from the DevMeeting should be available on the llvm website soon.

A few caveats:
* This is not intended for code review, just to serve as context for discussion when we post patches for individual pieces (the first of which should be posted within the next two weeks)
* It's based on a version of upstream llvm which is a few months old at this point
* This is a warts-and-all release of our development tree, with plenty of TODOs and unfinished experiments
* We haven't posted our clang changes yet

-Graham

Support for Scalable Vector Architectures in LLVM IR

# Background

*ARMv8-A Scalable Vector Extensions* is a vector ISA extension for AArch64,
featuring lane predication, scatter-gather, and speculative loads. It is
intended to scale with hardware such that the same binary running on a processor
with wider vector registers can take advantage of the increased compute power
without recompilation. This document describes changes made to LLVM that allow
its vectorizer to better target SVE.

As the vector length is no longer a statically known value, the way in which the
LLVM vectorizer generates code required modifications such that certain values
are now runtime evaluated expressions. The following section documents the
additional IR instructions needed to achieve this. During the design of these
extensions, we endeavoured to strike a balance between elegant and succinct
design and the pragmatic realities of generating efficient code for the SVE
architecture. We believe that the current design has a relatively small impact
on the IR and type system given the unusual characteristics of the target.

# Types

## Overview:

## IR Class Changes:

To represent a vector of unknown length a scaling property is added to the
`VectorType` class whose element count becomes an unknown multiple of a known
minimum element count. To be specific `unsigned NumElements` is replaced by
`class ElementCount` with its two members:

* `unsigned Min`: the minimum number of elements.
* `bool Scalable`: is the element count an unknown multiple of `Min`?

For non-scalable vectors (`Scalable=false`) the scale is assumed to be one and
thus `Min` becomes identical to the original `NumElements` property.

This mixture of static and runtime quantities allow us to reason about the
relationship between different scalable vector types without knowing their
exact length.

## IR Interface:

  static VectorType *get(Type *ELType, ElementCount EC);
  // Legacy interface whilst code is migrated.
  static VectorType *get(Type *ElType, unsigned NumEls, bool Scalable=false);

  ElementCount getElementCount();
  bool isScalable();

## IR Textual Form:

The textual IR for vectors becomes:

`<[n x] <m> x <type>>`

where `type` is the scalar type of each element, `m` corresponds to `Min` and
the optional string literal `n x ` signifies `Scalable` is *true* when present,
false otherwise. The textual IR for non-scalable vectors is thus unchanged.

Scalable vectors with the same `Min` value have the same number of elements,
and the same number of bytes if `Min * sizeof(type)` is the same:

`<n x 4 x i32>` and `<n x 4 x i8>` have the same number of elements.

`<n x 4 x i32>` and `<n x 8 x i16>` have the same number of bytes.

## SelectionDAG

New scalable vector MVTs are added, one for each existing vector type. Scalable
vector MVTs are modelled in the same way as the IR. Hence, `<n x 4 x i32>`
becomes `nxv4i32`.

## MVT Interface:

  static MVT getVectorVT(MVT VT, ElementCount EC);
  bool isScalableVector() const;
  static mvt_range integer_scalable_valuetypes();
  static mvt_range fp_scalable_valuetypes();

To minimise the effort required for common code to preserve the scalable flag
we extend the helper functions within MVT/EVT classes to cover more cases. For
example:

  /// Return a VT for a vector type whose attributes match ourselves
  /// but with an element type chosen by the caller.
  EVT changeVectorElementType(EVT EltVT)`

# Instructions

The majority of instructions work seamlessly when applied to scalable vector
types. However, on occasion assumptions are made that allow vectorization logic
to be reduced directly to constants completely bypassing the IR (e.g. when the
element count is known). These assumption are generally unsafe for scalable
vectors which forces a requirement to express more logic in IR. To help with
this the following instruction are added to the IR.

## *elementcount*

### Syntax:

`<result> = elementcount <n x m x ty> <v1> as <ty2>`

### Overview:

This instruction returns the actual number of elements in the input vector of
type `<n x m x ty>` as the scalar type `<ty2>`. This is primarily used to
increment induction variables and replaces many uses of `VF` within the loop
vectorizer.

### IRBuilder Interface:

  Value *CreateElementCount(Type *Ty, Value *V, const Twine &Name = "");

### Fixed-Width Behaviour:

A constant with the value of `Min` is created.

### SelectionDAG:

See [*ISD::ELEMENT_COUNT*](#isdelementcount).

## *seriesvector*

### Syntax:

`<result> = seriesvector <ty> <v1>, <v2> as <n x m x ty>`

### Overview:

This instruction returns a vector of type `<n x m x ty>` with elements forming
the arithmetic series:

`elt[z] = v1 + z * v2`

where `z` is the element index and `<ty>` a scalar type. This is primarily used
to represent a vector of induction values leading to:

* Predicate creation using vector compares for fully predicated loops (see also:
  [*propff*](#propff), [*test*](#test)).
* Creating offset vectors for gather/scatter via `getelementptr`.
* Creating masks for `shufflevector`.

For the following loop, a seriesvector instruction based on `i` is used to
create the data vector to store:

  unsigned a[LIMIT];

  for (unsigned i = 0; i < LIMIT; i++) {
    a[i] = i;
  }

*seriesvector* instructions can optionally have *NUW* and *NSW* flags attached
to them. The semantics of these flags are intended to match those of the usual
scalar instruction wrap flags, but apply element-wise to the result vector. If
any element addition of the current value and step results in a signed or
unsigned overflow with the corresponding flag set, then the result is a poison
value for the entire vector. However, this feature is not currently used.

### IRBuilder Interface:

  Value *CreateSeriesVector(VectorType::ElementCount EC, Value *Start,
                            Value* Step, const Twine &Name = "",
                            bool HasNUW = false, bool HasNSW = false);

### Fixed-Width Behaviour:

A constant vector is created with the same arithmetic series.

### SelectionDAG:

See [*ISD::SERIES_VECTOR*](#isdseriesvector).

## *test*

### Syntax:

`<result> = test <cond> <ty> <v1>`

### Overview:

This instruction returns a scalar boolean value based on the comparison of a
boolean or vector boolean operand. It allows us to reason about a value as a
whole rather than the per scalar comparison obtained from say *icmp*. The most
common use is testing the result of *icmp*/*fcmp*.

We use this for the controlling predicate of fully predicated loops. Loops with
a termination condition as follows:

  for (i = 0; i < LIMIT; ++i)

are handled by testing their control predicate for `first true` to signify
another iteration is required.

Support for scalar booleans is simply to provide symmetry so that all variants
of *icmp*/*fcmp* can be passed as input to *test*.

#### Supported Conditions:

* all false
* all true
* any false
* any true
* first false
* first true
* last false
* last true

### IRBuilder Interface:

  Value *CreateTest(TestInst::Predicate P, Value *V, const Twine &Name = "");

### Fixed-Width Behaviour

Same as scalable.

### SelectionDAG:

See [*ISD::TEST_VECTOR*](#isdtestvector).

## *propff*

### Syntax:

`<result> = propff <ty> <v1>, <v2>`

### Overview:

This instruction creates a partitioned boolean vector based on the position of
the first boolean false value in the concatenation of input boolean vectors `v1`
and `v2`. Given vectors of length `k`, an element `x[i]` in the result vector is
true if, and only if, each element `y[0]...y[i+k]` in the concatenated input
vector `y=v1:v2` is true.

The following examples show the results for a full `v1` and a `v2` which has
reached a termination condition, and a `v1` which previously reached a
termination condition with any `v2`.

\ ![Propff Examples](Propff_Behaviour.png)

A core use of this instruction is to protect against integer overflow that can
occur when maintaining the induction vector of a fully predicated loop. For SVE
the issue is further compounded with its support of non-power-of-2 vector
lengths.

We believe *propff* to be the least complex partitioning instruction to provide
sufficient abstraction yet achieve the expected SVE code quality. Although
other partitioning schemes can be modelled using *propff* we feel a more
[capable partitioning instruction](#partition) is worth highlighting because
it can simplify the vectorization of more loops.

### IRBuilder Interface:

  Value *CreatePropFF(Value* P1, Value *P2, const Twine &Name = "");

### Fixed-Width Behaviour:

Same as scalable.

### SelectionDAG:

See [*ISD::PROPAGATE\_FIRST\_ZERO*](#isdpropagatefirstzero).

## *shufflevector*

### Syntax:

`<result> = shufflevector <ty1> <v1>, <ty1> <v2>, <ty2> <mask>`

### Overview:

Not a new instruction but *shufflevector* is extended to accept non-constant
masks. For code that expects the original restriction
`ShuffleVectorInst::getMaskValue` has changed to guard against unsafe uses.

  // [old] Returns the mask value.
  static int getMaskValue(Constant *Mask, unsigned i);
  // [new] Returns true when Result contains the mask value, false otherwise.
  static bool getMaskValue(Value *Mask, unsigned i, int &Result);

Similar changes are made to related interfaces and their users.

### IRBuilder Interface:

  Value *CreateShuffleVector(Value *V1, Value *V2, Value *Mask,
                             const Twine &Name = "");

### Fixed-Width Behaviour:

No change.

### SelectionDAG:

See [*ISD::VECTOR\_SHUFFLE\_VAR*](#isdvectorshufflevar).

# Constants

Scalable vectors of known constants cannot be represented within LLVM due to
their unknown element count. Optimizations performed on scalable vectors become
much more reliant on `ConstantExpr`s. Most of the instructions talked about in
this document also have a matching `ConstantExpr` with the most common constant
folds duplicated to work with them.

This is most prevalent when considering vector splats. These are simulated via
a sequence of `insertelement`and `shufflevector` that are usually resolved to a
`Constant`. For scalable vectors this cannot be done and the original sequence
is maintained. To mitigate this, `PatternMatch` is extended to better support
`ConstantExpr`s along with new helpers like:

  #define m_SplatVector(X) \
    m_ShuffleVector( \
      m_InsertElement(m_Undef(), X, m_Zero()), \
      m_Value(), \
      m_Zero()) \

Zero is the exception with `zeroinitializer` applying equally well to scalable
and non-scalable vectors.

# Predicated floating point arithmetic {#predfparith}

When a loop is fully predicated it becomes necessary to have masked versions of
faulting instructions. Today we use the existing masked memory intrinsics to
ensure we do not access memory that would not have been accessed by the
original scalar loop.

The same principle applies to floating point instructions whereby we should not
trigger an exception that would not have been produced by the original scalar
loop.

We don't handle this today but are seeking advice as to the preferred method.
Options include:

1. Add masked versions of the affected instructions.
2. Extend existing instructions to take a predicate.
3. Construct IR that contains "safe" values for undefined locations.

Given the choice, Option 1 is our preferred route as it's in line with the
introduction of the masked memory intrinsics. Although, given the greater
potential for optimization, instruction forms may be preferable to intrinsics.

See also [*Predicated floating point intrinsics*](#predfpint)

# Intrinsics

## *llvm.masked.spec.load*

### Syntax:

`<result> = call {<data>, <pred>} @llvm.masked.spec.load.<type>(<ptr>, <align>, <mask>, <merge>)`

### Overview:

This intrinsic represents a load whereby only the first active lane is expected
to succeed with remaining lanes loaded speculatively. Along with the data this
intrinsic returns a predicate indicating which lanes of data are valid.

### Interface:

  CallInst *CreateMaskedSpecLoad(Value *Ptr, unsigned Align, Value *Mask,
                                 Value *PassThru = 0, const Twine &Name = "");

### Fixed-Width Behaviour:

Same as scalable.

## *llvm.ctvpop*

### Syntax:

### Overview:

This intrinsic returns a count of the set bits across a complete vector. It's
most useful when operating on predicates as it allows a portable way to create
predicate vectors that are partitioned differently (e.g. including the lane
with the first change, rather than *propff*'s exclusive behaviour).

### Interface:

  CallInst *CreateCntVPop(Value *Vec, const Twine &Name);

### Fixed-Width Behaviour

Same as scalable. However, it can also be modelled using `*llvm.ctpop*`
followed by a scalarized horizontal reduction. This is not possible with
scalable vectors because scalarization is not an option.

## *horizontal reductions*

Loops containing reductions are first vectorized as if no such reduction
exists, thus building a vector of accumulated results. When exiting the vector
loop a final in-vector reduction is done by effectively scalarizing the
operation across each of the result vector's elements.

For scalable vectors, whose element count is unknown, such scalarization is not
possible. For this reason we introduce target specific intrinsics to support
common reductions (e.g. horizontal add), along with TargetTransformInfo
extensions to query their availability. If a scalable vector loop requires a
reduction that's not provided by the target its vectorization is prevented.

When fully predicated vectorization is required, additional work is done within
the vector loop to ensure inactive lanes don't affect the accumulated result.
See also [Predicated floating point arithmetic](#predfparith).

## *Predicated floating point intrinsics* {#predfpint}

Libcalls not directly supported by the code generator must be serialized (See
[here](#pass_libcall)). Also, for fully predicated loops the scalar calls must
only occur for active lanes so we introduce predicated versions of the most
common routines (e.g. `*llvm.pow`). This is mostly transparent to the loop
vectorizer beyond the requirement to pass an extra parameter.

### TTI Interface:

  bool canReduceInVector(const RecurrenceDescriptor &Desc, bool NoNaN) const;
  Value* getReductionIntrinsic(IRBuilder<> &Builder,
                               const RecurrenceDescriptor& Desc, bool NoNaN,
                               Value* Src) const;

# SelectionDAG Nodes

## *ISD::BUILD_VECTOR* {#isdbuildvector}

`BUILD_VECTOR(ELT0, ELT1, ELT2, ELT3,...)`

This node takes as input a set of scalars to concatenate into a single vector.
By its very nature this node is incompatible with scalable vectors because
we don't know how many scalars it will takes to fill one. All common code
using this node has either been rewritten or bypassed.

## *ISD::ELEMENT_COUNT* {#isdelementcount}

`ELEMENT_COUNT(TYPE)`

See [*elementcount*](#elementcount).

## *ISD::EXTRACT_SUBVECTOR* {#isdextractsubvector}

`EXTRACT_SUBVECTOR(VECTOR, IDX)`

Usage of this node is typically linked to legalization where it's used to split
vectors. The index parameter is often absolute and proportional to the input
vector's element count.

For scalable vectors an absolute index makes little sense. We have changed this
parameter's meaning to become implicitly multiplied by `n` to match its main
usage.

The change has no effect when applied to non-scalable vectors, because `n ==
1`. No target specific code is affected and in many cases common code becomes
compatible with scalable vectors. For example:

`nxv2i64 extract_subvector(nxv4i64, 2)`

The real first lane becomes `n * 2`, resulting in the extraction of the top
half of the input vector. This maintains the intension of the original code for
both scalable and non-scalable vectors.

For common code that truly requires an absolute index we recommend a new
distinct ISD node to better differentiate such patterns.

## *ISD::INSERT_SUBVECTOR* {#isdinsertsubvector}

`INSERT_SUBVECTOR(VECTOR1, VECTOR2, IDX)`

We have made the equivalent change to this node's index parameter to match the
behaviour of [ISD::EXTRACT_SUBVECTOR](#ISD::EXTRACT_SUBVECTOR).

## *ISD::PROPAGATE\_FIRST\_ZERO* {#isdpropagatefirstzero}

`PROPAGATE_FIRST_ZERO(VEC1, VEC2)`

See [*propff*](#propff).

## *ISD::SERIES_VECTOR* {#isdseriesvector}

`SERIES_VECTOR(INITIAL, STEP)`

See [*seriesvector*](#seriesvector).

## *ISD::SPLAT_VECTOR* {#isdsplatvector}

`SPLAT_VECTOR(VAL)`

Within the code generator a splat is represented by ISD::BUILD_VECTOR and is
thus incompatible with scalable vectors.

We initially made use of [ISD::SERIES_VECTOR](#seriesvector) with a zero stride
but that brings with it a requirement for [ISD::SERIES_VECTOR](#seriesvector)
to support floating-point types. For this reason we created an explicit node
that also allowed less complex looking legalization/selection code.

## *ISD::TEST_VECTOR* {#isdtestvector}

`TEST_VECTOR(VEC, PRED)`

`VEC` is the boolean vector being tested, and `PRED` is a `TestCode` enum under
the `llvm::ISD` namespace which contains all the supported conditions.

See [*test*](#test).

## *ISD::VECTOR\_SHUFFLE\_VAR* {#isdvectorshufflevar}

`VECTOR_SHUFFLE_VAR(VEC1, VEC2, VEC3)`

See [*shufflevector*](#shufflevector).

# AArch64 Specific Changes

## Instruction Selection

In order to allow proper instruction selection there must be a direct mapping
from MVTs to SVE registers. SVE data registers have a length in multiples of
128bits (with 128bits being the minimum) and predicate registers have a bit for
every byte of a data register.

Given the 128bit minimum we map scalable vector MVTs whose static component is
also 128bits (e.g. MVT::nxv4i32) directly to SVE data registers. Scalable
vector MVTs with an `i1` element type and a static element count of 16
(128/8 = 16) or fewer (e.g. MVT::nxv4i1) are mapped to SVE predicate registers.

All other integer MVTs are considered illegal and are either split or promoted
accordingly. A similar rule applies to vector floating point MVTs but those
types whose static component is less that 128bits (MVT::nx2f32) are also mapped
directly to SVE data registers but in a form whereby elements are effectively
interleaved with enough undefined elements to fulfil the 128bit requirement.

We do this so that predicate bits correctly align to their data counterpart.
For example, for all vector MVTs with two elements, a predicate of nxv2i1 is
used, regardless of the data vector's element type.

## Stack Regions for SVE Objects

As SVE registers have an unknown size their associated fill/spill instructions
require an offset that will be implicitly scaled by the vector length instead
of bytes or element size. To accommodate this we introduce the concept of
`Stack Regions` that are areas on the stack associated with specific data types
or register classes.

Each `Stack Region` controls its own allocation, deallocation and internal
offset calculations. For SVE we create separate `Stack Region`s for its data
and predicate registers. Objects not belonging to a `Stack Region` use a
default so that existing targets are not affected.

## SVE-Specific Optimizations

The following SVE specific IR transformation passes are added to better guide
code generation. In some instances this has affected how loops are vectorized
because it's assumed they will occur. They also point to work other target's
may wish to consider.

### SVE Post-Vectorization Optimization Pass

This pass is responsible for converting generically vectorized loops into a
form that more closely resembles SVE's style of vectorization. For example,
rewriting a vectorized loop's control flow to utilize SVE's `while` instruction.

NOTE: This pass is very sensitive to the wrap flags (i.e. NSW/NUW). Much effort
has gone into ensuring they are preserved during vectorization to the extent of
introducing `llvm.assume` calls when beneficial.

### SVE Expand Libcall Pass{#pass_libcall}

In order to keep the loop vectorizer generic we maintain the ability to
vectorize libcalls we know the code generator cannot handle because scalable
vectors cannot be scalarized. This pass serializes such calls by introducing
loops using SVE's predicate iteration instructions.

Note that although such serialization can be achieved generically using
`extractelement`/`insertelement`, our experiments showed no route to efficient
code generation.

### SVE Addressing Modes Pass

This pass alleviates the negative effects of `Loop Strength Reduction` on
address computations for SVE targeted loops, leading to better code quality.

# Full Example

The following section shows how a C loop with a sum reduction is represented in
scalable IR (assuming -Ofast) and the final generated code.

  int SimpleReduction(int *a, int count) {
    int res = 0;
    for (int i = 0; i < count; ++i)
      res += a[i];

    return res;
  }

The IR representation shows each of the new instructions being used to control
the loop, along with one of the horizontal reduction intrinsics. The main
sequence (noted by the *;; Control* comments below) for controlling loop
iterations is:

1. Using *elementcount* to increment the induction variable
2. Using *seriesvector* starting from the current induction variable value to
    compare against a splat of the loop's trip count
3. Using *propff* to ensure the resulting predicate strictly partitions the
    predicate and does not wrap
4. Using *test* to determine whether the first lane is active (and thus at
    least one more iteration is required)

  define i32 @SimpleReduction(i32* nocapture readonly %a, i32 %count) #0 {
  entry:
    %cmp6 = icmp sgt i32 %count, 0
    br i1 %cmp6, label %min.iters.checked, label %for.cond.cleanup

  min.iters.checked:
    %0 = add i32 %count, -1
    %1 = zext i32 %0 to i64
    %2 = add nuw nsw i64 %1, 1
    %wide.end.idx.splatinsert = insertelement <n x 4 x i64> undef, i64 %2, i32 0
    %wide.end.idx.splat = shufflevector <n x 4 x i64> %wide.end.idx.splatinsert,
                              <n x 4 x i64> undef, <n x 4 x i32> zeroinitializer
    %3 = icmp ugt <n x 4 x i64> %wide.end.idx.splat, seriesvector (i64 0, i64 1)
    %predicate.entry = propff <n x 4 x i1> shufflevector (<n x 4 x i1> insertelement
                              (<n x 4 x i1> undef, i1 true, i32 0), <n x 4 x i1> undef,
                              <n x 4 x i32> zeroinitializer), %3
    br label %vector.body

  vector.body:
                                                        %min.iters.checked
    %index = phi i64 [ 0, %min.iters.checked ], [ %index.next, %vector.body ]
    %predicate = phi <n x 4 x i1> [ %predicate.entry, %min.iters.checked ],
                                         [ %predicate.next, %vector.body ]
    %vec.phi = phi <n x 4 x i32> [ zeroinitializer, %min.iters.checked ],
                                                    [ %8, %vector.body ]
    %4 = icmp ult i64 %index, 4294967296
    call void @llvm.assume(i1 %4)
    %5 = getelementptr inbounds i32, i32* %a, i64 %index
    %6 = bitcast i32* %5 to <n x 4 x i32>*
    %wide.masked.load = call <n x 4 x i32> @llvm.masked.load.nxv4i32(<n x 4 x i32>* %6,
                        i32 4, <n x 4 x i1> %predicate, <n x 4 x i32> undef), !tbaa !1
    %7 = select <n x 4 x i1> %predicate, <n x 4 x i32> %wide.masked.load,
                                           <n x 4 x i32> zeroinitializer
    %8 = add nsw <n x 4 x i32> %vec.phi, %7
    %index.next = add nuw nsw i64 %index, elementcount (<n x 4 x i64> undef) ;; Control 1
    %9 = add nuw nsw i64 %index, elementcount (<n x 4 x i64> undef)
    %10 = seriesvector i64 %9, i64 1 as <n x 4 x i64>                        ;; Control 2
    %11 = icmp ult <n x 4 x i64> %10, %wide.end.idx.splat
    %predicate.next = propff <n x 4 x i1> %predicate, %11                    ;; Control 3
    %12 = test first true <n x 4 x i1> %predicate.next                       ;; Control 4
    br i1 %12, label %vector.body, label %middle.block, !llvm.loop !5

  middle.block:
    %.lcssa = phi <n x 4 x i32> [ %8, %vector.body ]
    %13 = call i64 @llvm.aarch64.sve.uaddv.nxv4i32(<n x 4 x i1> shufflevector
         (<n x 4 x i1> insertelement (<n x 4 x i1> undef, i1 true, i32 0),
         <n x 4 x i1> undef, <n x 4 x i32> zeroinitializer), <n x 4 x i32> %.lcssa)
    %14 = trunc i64 %13 to i32
    br label %for.cond.cleanup

  for.cond.cleanup:
    %res.0.lcssa = phi i32 [ 0, %entry ], [ %14, %middle.block ]
    ret i32 %res.0.lcssa
  }

The main vector body of the resulting code is one instruction longer than it
would be for NEON, but no scalar tail is required and performance will scale
with register length. The *seriesvector*, *shufflevector*(splat), *icmp*,
*propff*, *test* sequence has been recognized and transformed into the `whilelo`
instruction.

\newpage

  SimpleReduction:
  // BB#0:
            subs    w9, w1, #1
            b.lt    .LBB0_4
  // BB#1:
            add     x9, x9, #1
            mov     x8, xzr
            whilelo p0.s, xzr, x9
            mov     z0.s, #0
  .LBB0_2:
            ld1w    {z1.s}, p0/z, [x0, x8, lsl #2]
            incw    x8
            add     z0.s, p0/m, z0.s, z1.s
            whilelo p0.s, x8, x9
            b.mi    .LBB0_2
  // BB#3:
            ptrue   p0.s
            uaddv   d0, p0, z0.s
            fmov    w0, s0
            ret
  .LBB0_4:
            mov     w0, wzr
            ret

<!--
Commented out until after DevMeeting talk
# Clang Changes

## ACLE Types

(TODO)
-->

# Appendix

The following is an alternative to the [*propff*](#propff) instruction described
above. We haven't implemented this (or worked out all the specific details)
since we can synthesize the initial partitions required with extra
operations, although matching those and transforming to appropriate AArch64 SVE
instructions is fragile given other optimization passes. This instruction is
more complex, but would allow us to explicitly specify the exact partition
desired.

## *partition*

### Syntax:

`<result> = partition first <part_on> [inclusive] <ty> <p> [, propagating <pprop>]`

`<result> = partition next <part_on> [inclusive] <ty> <p> from <pcont>`

### Overview:

This instruction creates a partitioned boolean vector based on an input vector
`p`. There are two main modes to consider. The *first* mode is an extended
version of [*propff*](#propff) where `part_on` is a boolean which determines
whether the partition is made based on the first false value or the first true
value. Propagating from a previous vector is optional. The `inclusive` option
produces a partition which includes the first true or false value; the default
is an exclusive partition which only covers the elements before the first true
or false value.

In cases where there are multiple subsets of data in a vector which must be
processed independently, we need a way to take the original boolean vector and
an existing partition to create the next partition to work on. This is provided
by the *next* form of *partition*. The `pcont` argument is the existing
partitioned vector, generated either by a *partition first* or another
*partition next*. If there are no more partitions after the current one, a
boolean vector of all false will be returned.

### Fixed-Width Behaviour:

Same as scalable.

ScalableIR.pdf (377 KB)

We've been working for the last two years on support for ARM's Scalable Vector Extension in LLVM, and we'd like to upstream our work. We've had to make several design decisions without community input, and would like to discuss the major changes we've made. To help with the discussions, I've attached a technical document (also in plain text below) to describe the semantics of our changes. We've made the entire patch available as a github repository at https://github.com/ARM-software/LLVM-SVE and our presentation slides and video from the DevMeeting should be available on the llvm website soon.

Hi Graham,

I began looking at this now and before I even delve deep into
understanding the IR changes, I have a few high-level comments on the
whole exercise...

* This is not intended for code review, just to serve as context for discussion when we post patches for individual pieces (the first of which should be posted within the next two weeks)

This email is long and hard to read. I'm not surprised no one replied
yet. I think your PDF attached is a good start away from the
complexity, but we're not going to get far if we try to do things in
one step.

Also, it would be good if at least some documentation was made
publicly available (specs, ISA, examples, scheduling, pipelines, etc),
so we could cross-reference and understand the tests. Better still, if
some kind of emulator existed, in order to test and validate the
tests.

Based on your repository, the number of changes is so great, and the
changes so invasive, that we really should look back at what we need
to do, one step at a time, and only perform the refactoring changes
that are needed for each step.

* This is a warts-and-all release of our development tree, with plenty of TODOs and unfinished experiments
* We haven't posted our clang changes yet

I don't mind FIXMEs or TODOs, but I did see a lot of spurious name
changes, enum value moves (breaking old binaries) and a lot of new
high-level passes (LoopVectorisationAnalysis) which will need a long
review on their own before we even start thinking about SVE.

I recommend you guys separate the refactoring from the implementation
and try to upstream the initial and uncontroversial refactorings (name
changes, etc), as well as move out the current functionality into new
passes, so then you can extend for SVE as a refactoring, not
move-and-extend in the same pass.

We want to minimise the number of changes, so that we can revert
breakages more easily, and have a steady progress, rather than a
break-the-world situation.

Finally, *every* test change needs to be scrutinised and guaranteed to
make sense. We really dislike spurious test changes, unless we can
prove that the test was unstable to being with, in which case we
change it to a better test.

For all those issues, the smaller the patches and the less
controversies they have, the better. Rule of thumb, have no more than
one controversy per patch.

Hope this makes sense.

cheers,
--renato

PS: If someone in the team is not subscribed to the list, I highly
recommend you to.

PS2: If you do subscribe, I highly recommend you develop some serious
filters. :slight_smile:

Hi Renato,

Sorry for the delay in responding. We've been busy rethinking some of our changes after the feedback we've received thus far (particularly from the devmeeting). The incremental patches will use our revised design(which should be less invasive), and I'll be updating our document to match.

This email is long and hard to read. I'm not surprised no one replied
yet. I think your PDF attached is a good start away from the
complexity, but we're not going to get far if we try to do things in
one step.

Based on your repository, the number of changes is so great, and the
changes so invasive, that we really should look back at what we need
to do, one step at a time, and only perform the refactoring changes
that are needed for each step.

We don't intend to do this all in one go; we fully expect that we'll need to refactor a few times based on community feedback as we incrementally add support for scalable vectors.

> * This is a warts-and-all release of our development tree, with plenty of TODOs and unfinished experiments
> * We haven't posted our clang changes yet
   
I don't mind FIXMEs or TODOs, but I did see a lot of spurious name
changes, enum value moves (breaking old binaries) and a lot of new
high-level passes (LoopVectorisationAnalysis) which will need a long
review on their own before we even start thinking about SVE.

I recommend you guys separate the refactoring from the implementation
and try to upstream the initial and uncontroversial refactorings (name
changes, etc), as well as move out the current functionality into new
passes, so then you can extend for SVE as a refactoring, not
move-and-extend in the same pass.

So our highest priority is getting basic support for SVE into the codebase (types, codegen, assembler/disassembler, simple vectorization); after that is in, we'll be happy to discuss our other changes like separating out loop vectorization legality, controlling loops via predication, or adding search loop vectorization.
    

We want to minimise the number of changes, so that we can revert
breakages more easily, and have a steady progress, rather than a
break-the-world situation.

Same for us. The individual patches will be relatively small, this repo was just for context if needed when discussing the smaller patches.
    

Finally, *every* test change needs to be scrutinised and guaranteed to
make sense. We really dislike spurious test changes, unless we can
prove that the test was unstable to being with, in which case we
change it to a better test.

Yep, makes sense.

Thanks,

-Graham

Hi,

Paul Walker has now uploaded the first set of IR support patches to phabricator, which use our revised design. We managed to remove the need for new instructions for basic scalable vectorization in favor of adding two new constant classes; here's a subset of the revised documentation describing just those constants:

## *vscale*

### Syntax:

`vscale`

### Overview:

This complex constant represents the runtime value of `n` for any scalable type
`<n x m x ty>`. This is primarily used to increment induction variables and
generate offsets.

### Interface:

  Constant *VScaleValue::get(Type *Ty);

### Example:

The following shows how an induction variable would be incremented for a
scalable vector of type `<n x 4 x i32>`.

  %index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 4)

## *stepvector*

### Syntax:

`stepvector`

### Overview:

This complex constant represents the runtime value of a vector of increasing
integers in the arithmetic series:

`<0, 1, 2, ... num_elements-1>`

This is the basis for a scalable form of vector constants. Adding a splat
changes the effective starting point, and multiplying changes the step. The
main uses for this are:

* Predicate creation using vector compares for fully predicated loops (see also:
  [*propff*](#propff), [*test*](#test)).
* Creating offset vectors for gather/scatter via `getelementptr`.
* Creating masks for `shufflevector`.

For the following loop, a `stepvector` constant would be added to a splat of the
loop induction variable to create the data vector to store:

  unsigned a[LIMIT];

  for (unsigned i = 0; i < LIMIT; i++) {
    a[i] = i;
  }

### Interface:

  Constant *StepVectorValue::get(Type *Ty);

### Example:

The following shows the construction of a scalable vector of the form
<start, start-2, start-4, ...>:

  %elt = insertelement <n x 4 x i32> undef, i32 %start, i32 0
  %widestart = shufflevector <n x 4 x i32> %elt, <n x 4 x i32> undef, <n x 4 x i32> zeroinitializer
  %step = insertelement <n x 4 x i32> undef, i32 -2, i32 0
  %widestep = shufflevector <n x 4 x i32> %step, <n x 4 x i32> undef, <n x 4 x i32> zeroinitializer
  %stridevec = mul <n x 4 x i32> stepvector, %widestep
  %finalvec = add <n x 4 x i32> %widestart, %stridevec

Current patch set:
https://reviews.llvm.org/D27101
https://reviews.llvm.org/D27102
https://reviews.llvm.org/D27103
https://reviews.llvm.org/D27105

-Graham

    Hi Renato,
    
    Sorry for the delay in responding. We've been busy rethinking some of our changes after the feedback we've received thus far (particularly from the devmeeting). The incremental patches will use our revised design(which should be less invasive), and I'll be updating our document to match.
    
    > This email is long and hard to read. I'm not surprised no one replied
    > yet. I think your PDF attached is a good start away from the
    > complexity, but we're not going to get far if we try to do things in
    > one step.
    
    > Based on your repository, the number of changes is so great, and the
    > changes so invasive, that we really should look back at what we need
    > to do, one step at a time, and only perform the refactoring changes
    > that are needed for each step.
    
    We don't intend to do this all in one go; we fully expect that we'll need to refactor a few times based on community feedback as we incrementally add support for scalable vectors.
    
    > > * This is a warts-and-all release of our development tree, with plenty of TODOs and unfinished experiments
    > > * We haven't posted our clang changes yet
    >
    > I don't mind FIXMEs or TODOs, but I did see a lot of spurious name
    > changes, enum value moves (breaking old binaries) and a lot of new
    > high-level passes (LoopVectorisationAnalysis) which will need a long
    > review on their own before we even start thinking about SVE.
    >
    > I recommend you guys separate the refactoring from the implementation
    > and try to upstream the initial and uncontroversial refactorings (name
    > changes, etc), as well as move out the current functionality into new
    > passes, so then you can extend for SVE as a refactoring, not
    > move-and-extend in the same pass.
    
    So our highest priority is getting basic support for SVE into the codebase (types, codegen, assembler/disassembler, simple vectorization); after that is in, we'll be happy to discuss our other changes like separating out loop vectorization legality, controlling loops via predication, or adding search loop vectorization.
        
    > We want to minimise the number of changes, so that we can revert
    > breakages more easily, and have a steady progress, rather than a
    > break-the-world situation.
    
    Same for us. The individual patches will be relatively small, this repo was just for context if needed when discussing the smaller patches.
        
    > Finally, *every* test change needs to be scrutinised and guaranteed to
    > make sense. We really dislike spurious test changes, unless we can
    > prove that the test was unstable to being with, in which case we
    > change it to a better test.
    
    Yep, makes sense.
    
    Thanks,
    
    -Graham

Hi Graham,

One high level comment without reading the patchset too much - it seems ‘vscale’ in particular could be just as easy to implement as an intrinsic, which would be a less invasive patch.

Is there a reason you didn’t go down the intrinsic route?

James

Hi Graham,

I'll look into the patches next, but first some questions after
reading the available white papers on the net.

This complex constant represents the runtime value of `n` for any scalable type
`<n x m x ty>`. This is primarily used to increment induction variables and
generate offsets.

What do you mean by "complex constant"? Surely not Complex, but this
is not really a constant either.

From what I read around (and this is why releasing the spec is

important, because I'm basing my reviews on guess work), is that the
length of a vector is not constant, even on the same machine.

In theory, according to a post in the ARM forums (which now I forget),
the kernel could choose the vector length per process, meaning this is
not known even at link time.

But that's ok, because the SVE instructions completely (I'm guessing,
again) bypass the need for that "constant" to be constant at all, ie,
the use of `incw/incp`. Since you can fail half-way through, the width
that you need to increment to the induction variable is not even known
at run time! Meaning, that's not a constant at all!

Example: a[i] = b[ c[i] ];
  ld1w z0.s, p0/z, [ c, i, lsl 2 ]
  ld1w z1.s, p0/z, [ b, z0.s, stxw 2 ]

Now, z0.s load may have failed with seg fault somewhere, and it's up
to the FFR to tell brka/brkb how to deal with this.

Each iteration will have:
  * The same vector length *per process* for accessing c[]
  * A potentially *different* vector length, *per iteration*, for accessing b[]

So, while <n x m x i32> could be constant on some vectors, even at
compile time (if we have a flag that forces certain length), it could
be unknown *per iteration* at run time.

  %index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 4)

Right, this would be translated to:
  incw x2

Now, the question is, why do we need "mul (i64 vscale, i64 4)" in the IR?

There is no semantic analysis you can do on a value that can change on
every iteration of the loop. You can't elide, hoist, combine or const
fold.

If I got it right (from random documents on the web), `incX` relates
to a number of "increment induction" functionality. `incw` is probably
"increment W", ie. 32-bits, while `incp` is "increment predicate", ie.
whatever the size of the predicate you use:

Examples:
  incw x2 # increments x2 to 4*(FFR successful lanes)
  incp x2, p0.b # increments x2 to 1*(FFR successful lanes)

So, this IR semantics is valid for the second case, but irrelevant for
the second. Also, I'm worried that we'll end up ignoring the
multiplier altogether, if we change the vector types (from byte to
word, for example), or make the process of doing so more complex.

The following shows the construction of a scalable vector of the form
<start, start-2, start-4, ...>:

  %elt = insertelement <n x 4 x i32> undef, i32 %start, i32 0
  %widestart = shufflevector <n x 4 x i32> %elt, <n x 4 x i32> undef, <n x 4 x i32> zeroinitializer
  %step = insertelement <n x 4 x i32> undef, i32 -2, i32 0
  %widestep = shufflevector <n x 4 x i32> %step, <n x 4 x i32> undef, <n x 4 x i32> zeroinitializer
  %stridevec = mul <n x 4 x i32> stepvector, %widestep
  %finalvec = add <n x 4 x i32> %widestart, %stridevec

This is really fragile and confusing, and I agree with James, an
intrinsic here would be *much* better.

Something like

%const_vec = <n x 4 x i32> @llvm.sve.constant_vector(i32 %start, i32 %step)

cheers,
--renato

Hi James,

With our goal of making scalable vectors a first class type we already want to change the IR to incorporate “vscale” as part of the printed type (i.e. it’s the “n” in “<n x 4 x i32>”). For this reason it makes sense to us if the isolated representation of “n” (i.e. vscale) is also treated as first class because the two representations go hand in hand.

The above is the stylistic answer but from a more practical point of view there will be many instances where “vscale” gets queried. Folds that currently exist for ConstantInt will need to have “vscale” variants. We concluded using an intrinsic would pollute the code base a lot more in the long run when compared to the new constant approached.

Paul!!!

Hi Paul,

Can you give us examples on where the vscale will be constant-folded?

cheers,
--renato

Hi Renato,

There's more comments inline below but I wanted to highlight a couple of things. Firstly we are not in a position to release the SVE specification at this time, we'll send a message when it's publicly available.

Related to this I want to push this and related conversations in a different direction. From the outset our approach to add SVE support to LLVM IR has been about solving the generic problem of vectorising for an unknown vector length and then extending this to support predication. With this in mind I would rather the problem and its solution be discussed at the IR's level of abstraction rather than getting into the guts of SVE.

I suggest this because trying to understand the nuances of mapping IR types to SVE registers, first faulting loads and predicate partition instructions without having access to the full specification is going to be painful, lead to confusion and is ultimately unnecessary.

Your example is potentially more complex than what we'll be working towards in the short to medium term. So I apologise if some of my responses seem a little dismissive but I'm keen to keep us on point whilst we work towards getting SVE support for the simple vectorisation cases upstream.

Paul!!!

Hi Graham,

I'll look into the patches next, but first some questions after
reading the available white papers on the net.

> This complex constant represents the runtime value of `n` for any scalable type
> `<n x m x ty>`. This is primarily used to increment induction variables and
> generate offsets.

What do you mean by "complex constant"? Surely not Complex, but this
is not really a constant either.

"complex constant" is the term used within the LangRef. Although its value can be different across certain interfaces this does not need to be modelled within the IR and thus for all intents and purposes we can safely consider it to be constant.

From what I read around (and this is why releasing the spec is
important, because I'm basing my reviews on guess work), is that the
length of a vector is not constant, even on the same machine.

In theory, according to a post in the ARM forums (which now I forget),
the kernel could choose the vector length per process, meaning this is
not known even at link time.

But that's ok, because the SVE instructions completely (I'm guessing,
again) bypass the need for that "constant" to be constant at all, ie,
the use of `incw/incp`. Since you can fail half-way through, the width
that you need to increment to the induction variable is not even known
at run time! Meaning, that's not a constant at all!

Example: a[i] = b[ c[i] ];
  ld1w z0.s, p0/z, [ c, i, lsl 2 ]
  ld1w z1.s, p0/z, [ b, z0.s, stxw 2 ]

This is not how speculation is handled within SVE. This is not the context to dig into this subject so perhaps we can start a separate thread. I ask this because speculation within the vectoriser is independent of scalable vectors.

Now, z0.s load may have failed with seg fault somewhere, and it's up
to the FFR to tell brka/brkb how to deal with this.

Each iteration will have:
  * The same vector length *per process* for accessing c[]
  * A potentially *different* vector length, *per iteration*, for accessing b[]

So, while <n x m x i32> could be constant on some vectors, even at
compile time (if we have a flag that forces certain length), it could
be unknown *per iteration* at run time.

I am not sure what point you are trying to make. I agree that when doing speculative loads the induction variable update is potentially different per iteration, being based on the result of the speculative load.

"vscale" is not trying to represent the result of such speculation. It's purely a constant runtime vector length multiplier. Such a value is required by LoopVectorize to update induction variables as describe below plus simple interactions like extracting the last element of a scalable vector.

On a related note don't directly link "<n x m x ty>" types to SVE registers. Although some will map directly we do adopt a similar approach as for non-scalable vectors in that within IR you can represent scalable vectors that are large/smaller than those directly supported by the target.

> ```llvm
> %index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 4)
> ```

Right, this would be translated to:
  incw x2

Now, the question is, why do we need "mul (i64 vscale, i64 4)" in the IR?

The answer is because that is how LoopVectorize updates its induction values.
For non-scalable vectors you would see:

    %index.next = add nuw nsw i64 %index, i64 4
    
for a VF of 4. Why wouldn't you want to see:

    %index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 4)

for a VF of "n*4" (remembering that vscale is the "n" in "<n x 4 x Ty>")

There is no semantic analysis you can do on a value that can change on
every iteration of the loop. You can't elide, hoist, combine or const
fold.

If I got it right (from random documents on the web), `incX` relates
to a number of "increment induction" functionality. `incw` is probably
"increment W", ie. 32-bits, while `incp` is "increment predicate", ie.
whatever the size of the predicate you use:

Examples:
  incw x2 # increments x2 to 4*(FFR successful lanes)
  incp x2, p0.b # increments x2 to 1*(FFR successful lanes)

So, this IR semantics is valid for the second case, but irrelevant for
the second. Also, I'm worried that we'll end up ignoring the
multiplier altogether, if we change the vector types (from byte to
word, for example), or make the process of doing so more complex.

As mentioned above I'd rather not describe the details of SVE instructions at this time because it'll only distract from the generic IR representation we are aiming for.

> The following shows the construction of a scalable vector of the form
> <start, start-2, start-4, ...>:
>
> ```llvm
> %elt = insertelement <n x 4 x i32> undef, i32 %start, i32 0
> %widestart = shufflevector <n x 4 x i32> %elt, <n x 4 x i32> undef, <n x 4 x i32> zeroinitializer
> %step = insertelement <n x 4 x i32> undef, i32 -2, i32 0
> %widestep = shufflevector <n x 4 x i32> %step, <n x 4 x i32> undef, <n x 4 x i32> zeroinitializer
> %stridevec = mul <n x 4 x i32> stepvector, %widestep
> %finalvec = add <n x 4 x i32> %widestart, %stridevec
> ```

This is really fragile and confusing, and I agree with James, an
intrinsic here would be *much* better.

Something like

%const_vec = <n x 4 x i32> @llvm.sve.constant_vector(i32 %start, i32 %step)

This intrinsic matches the seriesvector instruction we original proposed. However, on reflection we didn't like how it allowed multiple representations for the same constant. Instead we prefer "stepvector" to better allow a single canonical form for scalable vectors.

I know this doesn't preclude the use of an intrinsic, I just wanted to highlight that doing so doesn't automatically change the surrounding IR.

We wonder if this canonical form is worth explicitly using across all vector types to maintain a single code path (e.g. GEP related IR matching for strided access patterns) and to allow a prettier textual IR (e.g. a non-scalable 1024bit vector of bytes means a 128 entry constant vector to type).

Related to this I want to push this and related conversations in a different direction. From the outset our approach to add SVE support to LLVM IR has been about solving the generic problem of vectorising for an unknown vector length and then extending this to support predication. With this in mind I would rather the problem and its solution be discussed at the IR's level of abstraction rather than getting into the guts of SVE.

Hi Paul,

How scalable vectors operate is intimately related to how you
represent them in IR. It took a long time for the vector types to be
mapped to all available semantics. We still had to use a bunch of
intrinsics for scatter / gather, it took years to get the strided
access settled.

I understand that scalable vectors are orthogonal to all this, but as
a new concept, one that isn't available in any open source compiler I
know of, is one that will likely be very vague. Not publishing the
specs only make it worse.

I take the example of the ACLE and ARMv8.2 patches that ARM has been
pushing upstream. I have no idea what the new additions are, so I have
to take your word that they're correct. But later on, different
behaviour comes along for the same features with a comment "it didn't
work that way, let's try this". Sometimes, I don't even know what
failed, or why this new thing is better.

When that behaviour is constricted to the ARM back-end, it's ok. It's
a burden that me and Tim will have to carry, and so far, it has been a
small burden. But exposing the guts of the vectorizers (which are
already getting to a point where the need large refactorings), which
will affect all targets, need a bit more of concrete information.

The last thing we want is to keep changing how the vectorizer behaves
every six months without any concrete information as to why.

I also understand that LLVM is great at prototyping, and that's an
important step for companies like ARM to make sure their features work
as reliably as they expect in the wild, but I think adding new IR
semantics and completely refactoring core LLVM passes without a clue
is a few steps too far.

I'm not asking for a full spec. All I'm asking is for a description of
the intended basic functionality. Addressing modes, how to extract
information from unknown lanes, or if all reduction steps will be done
like `saddv`. Without that information, I cannot know what is the best
IR representation for scalable vectors or what will be the semantics
of shufffle / extract / insert operations.

"complex constant" is the term used within the LangRef. Although its value can be different across certain interfaces this does not need to be modelled within the IR and thus for all intents and purposes we can safely consider it to be constant.

From the LangRef:

"Complex constants are a (potentially recursive) combination of simple
constants and smaller complex constants."

There's nothing there saying it doesn't need to be modeled in IR.

"vscale" is not trying to represent the result of such speculation. It's purely a constant runtime vector length multiplier. Such a value is required by LoopVectorize to update induction variables as describe below plus simple interactions like extracting the last element of a scalable vector.

Right, I'm beginning to see what you mean...

The vectorizer needs that to be a constant at compile time to make
safety assurances.

For instance: for (1..N) { a[i+3] = a[i] + i; }

Has a max VF of 3. If the vectorizer is to act on that loop, it'll
have to change "vscale" to 3. If there are no loop dependencies, then
you leave as "vscale" but vectorizes anyway.

Other assurances are done for run time constants, for instance, tail
loops when changing

for (i=0; i<N; i++) -> for (i=0; i<N; i+=VF)

That VF is now a run-time "constant", and the vectorizer needs to see
it as much, otherwise it can't even test for validity.

So, the vectorizer will need to be taught two things:

1. "vscale" is a run time constant, and for the purpose of validity,
can be shrunk to any value down to two. If the value is shrunk, the
new compile time constant replaces vscale.

2. The cost model will *have* to treat "vscale" as an actual compile
time constant. This could come from a target feature, overriden by a
command line flag but there has to be a default, which I'd assume is
4, given that it's the lowest length.

    %index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 4)

for a VF of "n*4" (remembering that vscale is the "n" in "<n x 4 x Ty>")

I see what you mean.

Quick question: Since you're saying "vscale" is an unknown constant,
why not just:

   %index.next = add nuw nsw i64 %index, i64 vscale

All scalable operations will be tied up by the predication vector
anyway, and you already know what the vector type size is anyway.

The only worry is about providing redundant information that could go
stale and introduce bugs.

I'm assuming the vectorizer will *have* to learn about the compulsory
predication and build those vectors, or the back-end will have to
handle them, and it can get ugly.

%const_vec = <n x 4 x i32> @llvm.sve.constant_vector(i32 %start, i32 %step)

This intrinsic matches the seriesvector instruction we original proposed. However, on reflection we didn't like how it allowed multiple representations for the same constant.

Can you expand how this allows multiple representations for the same constant?

This is a series, with a start and a step, and will only be identical
to another which has the same start and step.

Just like C constants can "appear" different...

const int foo = 4;
const int bar = foo;
const int baz = 2 + 2;

I know this doesn't preclude the use of an intrinsic, I just wanted to highlight that doing so doesn't automatically change the surrounding IR.

I don't mind IR changes, I'm just trying to understand the need for it.

Normally, what we did in the past for some things was to add
intrinsics and then, if it's clear a native IR construct would be
better, we change it.

At least the intrinsic can be easily added without breaking
compatibility with anything, and since we're in prototyping phase
anyway, changing the IR would be the worst idea.

cheers,
--renato

Related to this I want to push this and related conversations in a different direction. From the outset our approach to add SVE support to LLVM IR has been about solving the generic problem of vectorising for an unknown vector length and then extending this to support predication. With this in mind I would rather the problem and its solution be discussed at the IR’s level of abstraction rather than getting into the guts of SVE.

Hi Paul,

How scalable vectors operate is intimately related to how you
represent them in IR. It took a long time for the vector types to be
mapped to all available semantics. We still had to use a bunch of
intrinsics for scatter / gather, it took years to get the strided
access settled.

I understand that scalable vectors are orthogonal to all this, but as
a new concept, one that isn’t available in any open source compiler I
know of, is one that will likely be very vague. Not publishing the
specs only make it worse.

I take the example of the ACLE and ARMv8.2 patches that ARM has been
pushing upstream. I have no idea what the new additions are, so I have
to take your word that they’re correct. But later on, different
behaviour comes along for the same features with a comment “it didn’t
work that way, let’s try this”. Sometimes, I don’t even know what
failed, or why this new thing is better.

When that behaviour is constricted to the ARM back-end, it’s ok. It’s
a burden that me and Tim will have to carry, and so far, it has been a
small burden. But exposing the guts of the vectorizers (which are
already getting to a point where the need large refactorings), which
will affect all targets, need a bit more of concrete information.

The last thing we want is to keep changing how the vectorizer behaves
every six months without any concrete information as to why.

I also understand that LLVM is great at prototyping, and that’s an
important step for companies like ARM to make sure their features work
as reliably as they expect in the wild, but I think adding new IR
semantics and completely refactoring core LLVM passes without a clue
is a few steps too far.

I’m not asking for a full spec. All I’m asking is for a description of
the intended basic functionality. Addressing modes, how to extract
information from unknown lanes, or if all reduction steps will be done
like saddv. Without that information, I cannot know what is the best
IR representation for scalable vectors or what will be the semantics
of shufffle / extract / insert operations.

“complex constant” is the term used within the LangRef. Although its value can be different across certain interfaces this does not need to be modelled within the IR and thus for all intents and purposes we can safely consider it to be constant.

From the LangRef:

“Complex constants are a (potentially recursive) combination of simple
constants and smaller complex constants.”

There’s nothing there saying it doesn’t need to be modeled in IR.

“vscale” is not trying to represent the result of such speculation. It’s purely a constant runtime vector length multiplier. Such a value is required by LoopVectorize to update induction variables as describe below plus simple interactions like extracting the last element of a scalable vector.

Right, I’m beginning to see what you mean…

The vectorizer needs that to be a constant at compile time to make
safety assurances.

For instance: for (1…N) { a[i+3] = a[i] + i; }

Has a max VF of 3. If the vectorizer is to act on that loop, it’ll
have to change “vscale” to 3. If there are no loop dependencies, then
you leave as “vscale” but vectorizes anyway.

Other assurances are done for run time constants, for instance, tail
loops when changing

for (i=0; i<N; i++) → for (i=0; i<N; i+=VF)

That VF is now a run-time “constant”, and the vectorizer needs to see
it as much, otherwise it can’t even test for validity.

So, the vectorizer will need to be taught two things:

  1. “vscale” is a run time constant, and for the purpose of validity,
    can be shrunk to any value down to two. If the value is shrunk, the
    new compile time constant replaces vscale.

  2. The cost model will have to treat “vscale” as an actual compile
    time constant. This could come from a target feature, overriden by a
    command line flag but there has to be a default, which I’d assume is
    4, given that it’s the lowest length.

%index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 4)

for a VF of “n*4” (remembering that vscale is the “n” in “<n x 4 x Ty>”)

I see what you mean.

Quick question: Since you’re saying “vscale” is an unknown constant,
why not just:

%index.next = add nuw nsw i64 %index, i64 vscale

All scalable operations will be tied up by the predication vector
anyway, and you already know what the vector type size is anyway.

The only worry is about providing redundant information that could go
stale and introduce bugs.

I’m assuming the vectorizer will have to learn about the compulsory
predication and build those vectors, or the back-end will have to
handle them, and it can get ugly.

%const_vec = <n x 4 x i32> @llvm.sve.constant_vector(i32 %start, i32 %step)

This intrinsic matches the seriesvector instruction we original proposed. However, on reflection we didn’t like how it allowed multiple representations for the same constant.

Can you expand how this allows multiple representations for the same constant?

This is a series, with a start and a step, and will only be identical
to another which has the same start and step.

Just like C constants can “appear” different…

const int foo = 4;
const int bar = foo;
const int baz = 2 + 2;

I know this doesn’t preclude the use of an intrinsic, I just wanted to highlight that doing so doesn’t automatically change the surrounding IR.

I don’t mind IR changes, I’m just trying to understand the need for it.

Normally, what we did in the past for some things was to add
intrinsics and then, if it’s clear a native IR construct would be
better, we change it.

At least the intrinsic can be easily added without breaking
compatibility with anything, and since we’re in prototyping phase
anyway, changing the IR would be the worst idea.

These last 3 paragraphs are a great summary of my position on this as well.

Thanks!

-eric

Thanks Renato, my takeaway is that I am presenting the design out of order. So let's focus purely on the vector length (VL) and ignore everything else. For SVE the vector length is unknown and can vary across an as yet undetermined boundary (process, library....). Within a boundary we propose making VL a constant with all instructions that operate on this constant locked within its boundary.

I know this stretches the meaning of constant and my reasoning (however unsound) is below. We expect changes to VL to be infrequent and not located where it would present an unnecessary barrier to optimisation. With this in mind the initial implementation of VL barriers would be an intrinsic that prevents any instruction movement across it.

Question: Is this type of intrinsic something LLVM supports today?

Why a constant? Well it doesn't change within the context it is being used. More crucially the LLVM implementation of constants gives us a property that's very important to SVE (perhaps this is where prototyping laziness has kicked in). Constants remain attached to the instructions that operate on them through until code generation. This allows the semantic meaning of these instruction to be maintained, something non-scalable vectors get for free with their "real" constants.

As a specific example take the vector reversal that LoopVectorize does when iterating backward through memory. For non-scalable vectors this looks thusly:

  shufflevector <4 x i32> %a, <4 x i32> undef, <i32 3, i32 2, i32 1, i32 0>

Throughout the IR and into code generation the intention of this instruction is clear. Now turning to scalable vectors the same operation becomes:

  shufflevector <n x 4 x i32> %a, <n x 4 x i32> undef, <n x 4 x i32> seriesvector ( sub (i32 VL, 1), i32 -1)

Firstly I'll highlight the use of seriesvector is purely for brevity, let's ignore that debate for now. Our concern is that not treating VL as a Constant means sub and seriesvector are no longer constant and are likely to be hoisted away from the shufflevector. The knock on effect being to force the code generator into generating generic vector permutes rather than utilise any specialised permute instructions the target provides.

Does this make sense? I am not after agreement just want to make sure we are on the same page regarding our aims before digging down into how VL actually looks and its interaction with the loop vectoriser’s chosen VF.

  Paul!!!

p.s.

I'll respond to the stepvector question later in a separate post to break down the different discussion points.

    > Related to this I want to push this and related conversations in a different direction. From the outset our approach to add SVE support to LLVM IR has been about solving the generic problem of vectorising for an unknown vector length and then extending this to support predication. With this in mind I would rather the problem and its solution be discussed at the IR's level of abstraction rather than getting into the guts of SVE.
    
    Hi Paul,
    
    How scalable vectors operate is intimately related to how you
    represent them in IR. It took a long time for the vector types to be
    mapped to all available semantics. We still had to use a bunch of
    intrinsics for scatter / gather, it took years to get the strided
    access settled.
    
    I understand that scalable vectors are orthogonal to all this, but as
    a new concept, one that isn't available in any open source compiler I
    know of, is one that will likely be very vague. Not publishing the
    specs only make it worse.
    
    I take the example of the ACLE and ARMv8.2 patches that ARM has been
    pushing upstream. I have no idea what the new additions are, so I have
    to take your word that they're correct. But later on, different
    behaviour comes along for the same features with a comment "it didn't
    work that way, let's try this". Sometimes, I don't even know what
    failed, or why this new thing is better.
    
    When that behaviour is constricted to the ARM back-end, it's ok. It's
    a burden that me and Tim will have to carry, and so far, it has been a
    small burden. But exposing the guts of the vectorizers (which are
    already getting to a point where the need large refactorings), which
    will affect all targets, need a bit more of concrete information.
    
    The last thing we want is to keep changing how the vectorizer behaves
    every six months without any concrete information as to why.
    
    I also understand that LLVM is great at prototyping, and that's an
    important step for companies like ARM to make sure their features work
    as reliably as they expect in the wild, but I think adding new IR
    semantics and completely refactoring core LLVM passes without a clue
    is a few steps too far.
    
    I'm not asking for a full spec. All I'm asking is for a description of
    the intended basic functionality. Addressing modes, how to extract
    information from unknown lanes, or if all reduction steps will be done
    like `saddv`. Without that information, I cannot know what is the best
    IR representation for scalable vectors or what will be the semantics
    of shufffle / extract / insert operations.
    
    > "complex constant" is the term used within the LangRef. Although its value can be different across certain interfaces this does not need to be modelled within the IR and thus for all intents and purposes we can safely consider it to be constant.
    
    From the LangRef:
    
    "Complex constants are a (potentially recursive) combination of simple
    constants and smaller complex constants."
    
    There's nothing there saying it doesn't need to be modeled in IR.
    
    > "vscale" is not trying to represent the result of such speculation. It's purely a constant runtime vector length multiplier. Such a value is required by LoopVectorize to update induction variables as describe below plus simple interactions like extracting the last element of a scalable vector.
    
    Right, I'm beginning to see what you mean...
    
    The vectorizer needs that to be a constant at compile time to make
    safety assurances.
    
    For instance: for (1..N) { a[i+3] = a[i] + i; }
    
    Has a max VF of 3. If the vectorizer is to act on that loop, it'll
    have to change "vscale" to 3. If there are no loop dependencies, then
    you leave as "vscale" but vectorizes anyway.
    
    Other assurances are done for run time constants, for instance, tail
    loops when changing
    
    for (i=0; i<N; i++) -> for (i=0; i<N; i+=VF)
    
    That VF is now a run-time "constant", and the vectorizer needs to see
    it as much, otherwise it can't even test for validity.
    
    So, the vectorizer will need to be taught two things:
    
    1. "vscale" is a run time constant, and for the purpose of validity,
    can be shrunk to any value down to two. If the value is shrunk, the
    new compile time constant replaces vscale.
    
    2. The cost model will *have* to treat "vscale" as an actual compile
    time constant. This could come from a target feature, overriden by a
    command line flag but there has to be a default, which I'd assume is
    4, given that it's the lowest length.
    
    > %index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 4)
    >
    > for a VF of "n*4" (remembering that vscale is the "n" in "<n x 4 x Ty>")
    
    I see what you mean.
    
    Quick question: Since you're saying "vscale" is an unknown constant,
    why not just:
    
       %index.next = add nuw nsw i64 %index, i64 vscale
    
    All scalable operations will be tied up by the predication vector
    anyway, and you already know what the vector type size is anyway.
    
    The only worry is about providing redundant information that could go
    stale and introduce bugs.
    
    I'm assuming the vectorizer will *have* to learn about the compulsory
    predication and build those vectors, or the back-end will have to
    handle them, and it can get ugly.
    
    >> %const_vec = <n x 4 x i32> @llvm.sve.constant_vector(i32 %start, i32 %step)
    >
    > This intrinsic matches the seriesvector instruction we original proposed. However, on reflection we didn't like how it allowed multiple representations for the same constant.
    
    Can you expand how this allows multiple representations for the same constant?
    
    This is a series, with a start and a step, and will only be identical
    to another which has the same start and step.
    
    Just like C constants can "appear" different...
    
    const int foo = 4;
    const int bar = foo;
    const int baz = 2 + 2;
    
    > I know this doesn't preclude the use of an intrinsic, I just wanted to highlight that doing so doesn't automatically change the surrounding IR.
    
    I don't mind IR changes, I'm just trying to understand the need for it.
    
    Normally, what we did in the past for some things was to add
    intrinsics and then, if it's clear a native IR construct would be
    better, we change it.
    
    At least the intrinsic can be easily added without breaking
    compatibility with anything, and since we're in prototyping phase
    anyway, changing the IR would be the worst idea.
    
    cheers,
    --renato

I'm not asking for a full spec. All I'm asking is for a description of
the intended basic functionality. Addressing modes, how to extract
information from unknown lanes, or if all reduction steps will be done
like `saddv`. Without that information, I cannot know what is the best
IR representation for scalable vectors or what will be the semantics
of shufffle / extract / insert operations.

If you want to know more, our dev meeting talk and slides will
hopefully be available soon. If there'll be a significant delay we can
publish the slides ourselves for you to look at, those should be
sufficient for you to understand enough of the details to form an
opinion. We also have a white paper on general SVE and vector-length
agnostic programming available here:
http://developer.arm.com/hpc/a-sneak-peek-into-sve-and-vla-programming

Thanks,
Amara

Hi Amara,

Thanks! I've already dissected that one. :slight_smile:

It's probably the easiest way into SVE, for now.

cheers,
--renato

I'm sorry.. may I interrupt for a minute and try to grok things for a
bit different angle..

While the VL can vary.. in practice wouldn't the cost of vectorization
and width be tied more to the hardware implementation than anything
else? The cost of vectorizing thread 1 vs 2 isn't likely to change?
(Am I drunk and mistaken?)

If the above holds true then the the length would be only variable
between different hardware implementations.. (at least this is how I
understand it)

This seems tightly coupled to hardware..

Mistaken. :slight_smile:

The scale of the vector can change between two processes on the same
machine and it's up to the kernel (I guess) to make sure they're
correct.

In theory, it could even change in the same process, for instance, as
a result of PGO or if some loops have less loop-carried dependencies
than others.

The three important premises are:

1. The vectorizer still has the duty to restrict the vector length to
whatever makes it cope with the loop dependencies. SVE *has* to be
able to cope with that by restricting the number of lanes "per
access".

2. The cost analysis will have to assume the smallest possible vector
size and "hope" that anything larger will only mean profit. This seems
straight-forward enough.

3. Hardware flags and target features must be able to override the
minimum size, maximum size, etc. and it's up to the users to make sure
that's meaningful in their hardware.

cheers,
--renato

Thanks Renato, my takeaway is that I am presenting the design out of order. So let's focus purely on the vector length (VL) and ignore everything else. For SVE the vector length is unknown and can vary across an as yet undetermined boundary (process, library....). Within a boundary we propose making VL a constant with all instructions that operate on this constant locked within its boundary.

This is in line with my current understanding of SVE. Check.

I know this stretches the meaning of constant and my reasoning (however unsound) is below. We expect changes to VL to be infrequent and not located where it would present an unnecessary barrier to optimisation. With this in mind the initial implementation of VL barriers would be an intrinsic that prevents any instruction movement across it.

Question: Is this type of intrinsic something LLVM supports today?

Function calls are natural barriers, but they should outline the
parameters that cannot cross, especially if they're local, to make
sure those don't cross it. In that sense, specially crafted intrinsics
can get you the same behaviour, but it will be ugly.

Also, we have special purpose barriers, ex. @llvm.arm|aarch64.dmb,
which could serve as template for scalable-specific barriers.

Why a constant? Well it doesn't change within the context it is being used. More crucially the LLVM implementation of constants gives us a property that's very important to SVE (perhaps this is where prototyping laziness has kicked in). Constants remain attached to the instructions that operate on them through until code generation. This allows the semantic meaning of these instruction to be maintained, something non-scalable vectors get for free with their "real" constants.

This makes sense. Not just because it behaves similarly, but because
the back-end *must* guarantee it will be a constant within its
boundaries and fail otherwise. That's up to the SVE code-generator to
add enough SVE-specific instructions to get that right.

        shufflevector <n x 4 x i32> %a, <n x 4 x i32> undef, <n x 4 x i32> seriesvector ( sub (i32 VL, 1), i32 -1)

Firstly I'll highlight the use of seriesvector is purely for brevity, let's ignore that debate for now. Our concern is that not treating VL as a Constant means sub and seriesvector are no longer constant and are likely to be hoisted away from the shufflevector. The knock on effect being to force the code generator into generating generic vector permutes rather than utilise any specialised permute instructions the target provides.

The concept looks ok.

IIGIR, your argument is that an intrinsic will not look "constant
enough" to the other IR passes, which can break the contantness
required to generate the correct "constant" vector.

I'm also assuming SVE has an instruction that relates to the syntax
above, which will reduce the setup process from N instructions to one
and will be scale-independent. Otherwise, that whole exercise is
meaningless.

Something like:
  mov x2, #i
  const z0.b, p0/z, x2, 2 # From (i) to (2*VF)
  const z1.b, p0/z, x2, -1 # From (i) to (i - VF) in reverse

The undefined behaviour that will come of such instructions need to be
understood in order to not break the IR.

For example, if x2 is an unsigned variable and you iterate through the
array but the array length is not a multiple of VF, the last range
will pass through zero and become negative at the end. Or, if x2 is a
16-bit variable that must wrap (or saturate) and the same tail issue
happens above.

Does this make sense? I am not after agreement just want to make sure we are on the same page regarding our aims before digging down into how VL actually looks and its interaction with the loop vectoriser’s chosen VF.

As much sense as is possible, I guess.

But without knowing the guarantees we're aiming for, it'll be hard to
know if any of those proposals will make proper sense.

One way to make your "seriesvector" concept show up *before* any spec
is out is to apply it to non-scalable vectors.

Today, we have the "zeroinitializer", which is very similar to what
you want. You can even completely omit the "vscale" if we get the
semantics right.

Hope that helps.

cheers,
--renato

I'll bite my tongue on negative comments, but it seems that for
anything other than trivial loops this is going to put the burden
entirely on the user. Are you telling me the *kernel* is really going
to be able to make these decisions on the fly, correctly?

Won't this block loop transformations?

Most of SVE is not yet public, but from what is public a lot of the underlying concepts appear very similar to the Hwacha architecture from UCB, which has had quite a few publications and might be an easier thing for people who don’t have ARM NDAs to read about.

David

No. Let's make one thing clear now: we don't expect the VL to be
changed on the fly, once the process is started it's fixed. Otherwise
things like stack frames with SVE objects will be invalid.

Responding to Renato's mail earlier:
1. The vectorizer will have to deal with loop carried dependences as
normal, if it doesn't have a guarantee about the VL then it has to
either avoid vectorizing some loops, or it can cap the effective
vectorization factor by restricting the loop predicate to a safe
value.
2. Yes the cost model is more complicated, but it's not necessarily
the case that we assume the smallest VL. We can cross that bridge when
we get to it though.

Amara