[RFC] Vector/SIMD ISA Context Abstraction

(please cc me i am subscribed digest)

i have an idea which i have been meaning to float for some time. as
context: i am the lead author of the Draft SVP64 Cray-like Vector
Extensions for the Power ISA, which is being designed for Hybrid CPU,
VPU and 3D GPU workloads.

SVP64 is similar to Broadcom VideoCore IV's "repeat" feature and to
x86 "REP" but with Vectorisation Context. unlike x86 REP which simply
repeats the following instruction, SVP64 *augments* the following
instruction to:

* change any one of src and dest registers to scalar or vector
* adds both src *and dest* predication in some cases, and
* overrides the element width of src and additionally overrides dest
register width (8/16/32/64 bit or FP16/BF16/FP32/FP64) and
* adds several modes including saturation, fail-first, iteration and
reduction and other modes never seen in any commercial ISA.

there are also two modes of operation:

* Vertical First which requires explicit incrementing of the Vector
Element offset (effectively turning the register file into an
indexable SRAM)
* Horizontal First which is equivalent to the original Cray Vectors and to RVV.

Vertical-First may be permitted to execute an arbitrary number of
elements in parallel "batches": interestingly, when those batches are
chosen at runtime to be equal to Maximum Vector Length, that
effectively executes *all* element operations Horizontally and
incidentally is directly equivalent to Cray-style Vector execution.

here's the problem:

where the Scalar Power ISA for the SFFS compliancy subset is 214
instructions, SVP64 Context is 24 bits and consequently multiplies
those 214 instructions to well north of a QUARTER OR A MILLION ISA
Intrinsics.

adding in GPU-style Swizzle context and the Draft REMAP looping for
Matrix Multiply, FFT, DCT, Iterative reduction and other modes and it
could well be several MILLION intrinsics.

the standard approach used to autogenerate intrinsics with scripts,
making all intrinsics available as a flat header file or c++ template,
which works extremely well for all other ISAs, are therefore
absolutely out of the question.

if however instead of an NxM problem this was turned into N+M,
separating out "scalar base" from "augmentation" throughout the IR,
the problem disappears entirely.

the nice thing about that approach is that it also tidies up other
ISAs as well, including SIMD ones. very few ISAs have intrinsics
which are only inherently meaningful in a Vector context (a cross
product instruction would be a perfect illustrative exception to that
rule).

even permute / shuffle Vector/SIMD operations are separateable into
"base" and "abstract Vector Concept": the "base" operation in that
case being "MV.X" (scalar register copy, indexable - reg[RT] =
reg[reg[RA]] and immediate variant reg[RT] = reg[RA+imm])

the issue is that this is a massive intrusive change, effectively a
low-level redesign of LLVM IR internals for every single back-end.

on the other hand, as we make progress over the next few years with
SVP64, if there was resistance to this concept, trying to shoe-horn
SVP64 into an NxM intrinsics concept is guaranteed to limit SVP64's
full capqbilities or, worse, run people's machines out of resources
during compilation, and ultimately cause complaints about LLVM's
performance.

i have no idea where to go with this, and wanted to open up the floor
to alternatives as well as present an opprtunity for discussion of the
ramifications, advantages and disadvantages of separating out
parallelism / vectorisation as an abstract concept from scalar "base"
intrinsics, and what that would look like in practice.

also, i have to ask: has anything like this ever been considered before?

l.

if however instead of an NxM problem this was turned into N+M,
separating out “scalar base” from “augmentation” throughout the IR,
the problem disappears entirely.

Hi Luke,

It’s not entirely clear to me what you are suggesting here.

For context:

  • Historically, we have tried to keep as many instructions as native IR as possible to avoid the explosion of intrinsics, as you describe.
  • However, traditionally, intrinsics reduce the number of instructions in a basic block instead of increasing them, so there’s always the balance.
  • For example, some reduction intrinsics were added to address bloat, but no target is forced to use them.
  • If you can represent the operation as a series of native IR instructions, by all means, you should do so.

I get it that a lot of intrinsics are repeated patterns over all variations and that most targets don’t have that many, so it’s “ok”.

I also get it that most SIMD vector operations aren’t intrinsically vector, but expansions of scalar operations for the benefit of vectorisation (plus predication, to avoid undefined behaviour and to allow “funny” patterns, etc).

But it’s not clear to me what the “augmentation” part would be in other targets.

even permute / shuffle Vector/SIMD operations are separateable into
“base” and “abstract Vector Concept”: the “base” operation in that
case being “MV.X” (scalar register copy, indexable - reg[RT] =
reg[reg[RA]] and immediate variant reg[RT] = reg[RA+imm])

Shuffles are already represented as IR instructions (insert/extract vector), so I’m not sure this clarifies much.

Have you looked at the current scalable vector implementation?

It allows a set of operations on open-ended vectors that are controlled by a predicate, which is possibly the “augmentation” that you’re looking for?

the issue is that this is a massive intrusive change, effectively a
low-level redesign of LLVM IR internals for every single back-end.

Not necessarily.

For example, scalable vectors are being introduced in a way that non-scalable back-ends (mostly) won’t notice.
And it’s not just adding a few intrinsics, the very concept of vectors was changed.
There could be a (set of) construct(s) for your particular back-end that is invisible to others.

Of course, the more invisible things, the harder it is to validate and change intersections of code, so the change must really be worth the extra hassle.
With both Arm and RISCV implementing scalable extensions, that change was deemed worthy and work is progressing.
So, if you could leverage the existing code to your advantage, you’d avoid having to convince a huge community to implement a large breaking change.
And you’d also give us one more reason for the scalable extension to exist. :slight_smile:

Hope this helps.

cheers,
–renato

(renato thank you for cc'ing, due to digest subscription at the moment)

if however instead of an NxM problem this was turned into N+M,
separating out "scalar base" from "augmentation" throughout the IR,
the problem disappears entirely.

Hi Luke,

It's not entirely clear to me what you are suggesting here.

it's a nebulous but fundamentally low-level concept, that may take some
time to sink in, and also appreciate the significance.

some background: over the past 3+ years i have made a comprehensive
comparative study of ISAs (both SIMD and Vector), the latter only being
revived recently thanks to RVV bringing Cray-style Vectors back into the
forefront of computing research.

here is a quick summary of that:

* Packed SIMD. the worst kind of ISA. the following article puts
  it diplomatically:
https://www.sigarch.org/simd-instructions-considered-harmful/
  i have no such compunction or affiiation, and as an engineer can speak
  freely and plainly. gloves off, statement of fact: Packed SIMD
  is the worst thing to become ubiquitous in computer science, bar none.
  frankly, the sooner that Packed SIMD is shot and buried as an embarrassing
  and incredibly expensive historical footnote in the history of
computing, the better
  [there's always exceptions: for e.g. embedded DSP Audio, Packed SIMD
is perfect].

* Predicated SIMD: by total contrast, this is actually not half bad. SVE2,
  AVX-512, GPU ISAs, anything that takes masks per element, the masks
  completely eliminate Packed SIMD Hell (except for ISAs that still have
  SIMD alignment for memory LD/STs even for Predicated LD/STs,
  but hey, nothing's perfect)

* Horizontal-First Vectors. known best from the Cray-i, also in modern ISAs
  such as NEC's SX-Aurora and now RVV, horizontal vectors are of the
  form "for i in range(VL) operation(vec_src_reg[i], vec_dest_regs[i])"

* Vertical-First Vectors. these are *NOT* very well-known, and the only
  two ISAs that i know of (to date) are SVP64 and Mitch Alsup's MyISA 66000.

btw, here's the [terse] page on SVP64's format:
    https://libre-soc.org/openpower/sv/svp64/
the overview is much more useful for understanding:
    https://libre-soc.org/openpower/sv/overview/

additional relevant ISAs:

* Broadcom VideoCore IV which has a Vector-form of "REP", where the
  operation may be repeated 2, 4, 8, 16, 32 or "the count taken from
  scalar r0".

* the Mill Architecture, which is a "tagged" ISA. here, there is no "ADD32",
  "ADD16", "ADD64", "ADD8", there is *ONLY* "ADD". the width of the
  operation is taken *FROM THE LOAD OPERATION* which is the ONLY
  place where the register operand width is specified. operations are
  DEDUCED, statically, at compile-time, based on the "tags" associated
  with source registers as they cascade through. two strategically
  important instructions are included: WIDEN and NARROW.

the significance of mentioning the Mill is that the ISA has a closer
match to the simple (basic) LLVM intrinsics than most other ISAs.
however even there (unless they've done drastic and extensive
changes) they will be limited to trying to fit a flexible (tagged) ISA
into an inflexible IR that was never designed with "context" in mind.

For context:
* Historically, we have tried to keep as many instructions as native IR as possible to avoid the explosion of intrinsics, as you describe.

a crucially important goal that gets a big thumbs-up from me.

* However, traditionally, intrinsics reduce the number of instructions in a basic block instead of increasing them, so there's always the balance.

where the opposite of that is that the CISC-ness of a given new
intrinsic itself could impact ISAs that don't support that feature
natively, making it necessary for them to emit rather more assembly
instructions than it first appears.

* For example, some reduction intrinsics were added to address bloat, but no target is forced to use them.

excellent. iteration and reduction (including fixed schedule
paralleliseable reduction) is one of the intrinsics being added to
SVP64. it's good to hear that that, as a concept, has been added. if
i may, i will use that as an example, later.

* If you can represent the operation as a series of native IR instructions, by all means, you should do so.

this assumes (perfectly reasonably, mind you) that the (hypothetical)
ISA itself is not capable of expressing a given operation *in* IR, and
consequently has to be done as a series of passes, substituting for a
lack of native (direct) support of a given operation with some
(faster?) operations that *do* (ultimately) exist as actual assembler.

in some architectures a particular native IR instruction might
actually exist, but the native assembler variant is so horribly slow
at the hardware level that alternatives are actually *demanded* by
users. AVX's native Horizontal Reduction instructions would be a good
example.

I get it that a lot of intrinsics are repeated patterns over all variations and that most targets don't have that many, so it's "ok".

I also get it that most SIMD vector operations aren't intrinsically vector,

[indeed. i have spent considerable time recently on the wikipedia
Vector_processor page, and associated nearby related pages
(SIMD, GPUs, etc), correcting that unfortunate meme that "SIMD equals
vectors". this is unfortunately where Corporate Marketing has badly
interfered with actual Computer Science. sigh.]

but expansions of scalar operations for the benefit of vectorisation
(plus predication, to avoid undefined behaviour and to allow "funny" patterns, etc).

yes. and this perspective is where Mitch Alsup's MyISA 66000, and SVP64's
"Vertical-First" Mode come into play: the instructions in both are effectively
executed *in scalar form ONLY* (as far as the Program Order is concerned),
and, at the end of a loop/branch, you *EXPLICITLY* increment the element
index, such that all *SCALAR* operations in the loop now execute on *scalar*
element one. end of loop, explicit increment element index to 2, loop back
and execute on element *two* of the Vector Register. repeat until
loop-termination
condition.

the challenge ahead for Libre-SOC (and for MyISA 66000) will be to introduce
this entirely new concept to compilers. however given that it's effectively
scalar rather than Vector, the task should actually be a *lot* easier than it
is for Horizontal-First ISAs such as SVE and RVV.

But it's not clear to me what the "augmentation" part would be in other targets.

the proposal is - at its heart - to replace all IR of the form:

    llvm.masked.load.v16f32.predicatespec(arguments)
    llvm.masked.load.v2f64.predicatespec(arguments)

and so on

with just:

     llvm.load(mask=x, arguments).

where there *is* no llvm.masked.load, there is *only* an optional argument
that *at runtime* (or, compile-time more like, i.e. when running llvm) rather
than expands out explicitly / statically to dozens of special IR intrinsics,
*there is only one*:

     llvm.load.

additional optional arguments also then specify whether this operation
is twin-predicated by having a *second* predicate mask (yes, SVP64
can apply one predicate mask to the source, and another to the destination.
conceptually this is equivalent to back-to-back VGATHER-VSCATTER).

additional optional arguments also then specify whether there is SWIZZLE
applied, or Sub-Vectors, or any other types of "augmentation".

now, here's the kicker:

what we need to support SVP64 is for *all llvm basic intrinsics to support
all possible optional augmentations of all possible types*.

yes, really, that's not a typo or a mis-statement. we *genuinely* need
a sign-extended twin-predicated intrinsic:

     llvm.sext(source_mask=source_pred, dest_mask=dest_pred, source_argument)

even permute / shuffle Vector/SIMD operations are separateable into
"base" and "abstract Vector Concept": the "base" operation in that
case being "MV.X" (scalar register copy, indexable - reg[RT] =
reg[reg[RA]] and immediate variant reg[RT] = reg[RA+imm])

Shuffles are already represented as IR instructions (insert/extract vector), so I'm not sure this clarifies much.

ok, so is it possible to do shuffle-sign-extend, shuffle-fptrunc,
shuffle-fabs, shuffle-sqrt, shuffle-log, and any other single-src
single-dest operation?

this is where the "augmentation" - the separation of PREFIX-SUFFIX
comes into play.

SVP64 has the ability to set up "SWIZZLE" contexts as well as certain
kinds of "REMAP" Schedules (triple-loop butterfly schedules) -
PREFIXes - that can be *applied* to base operations (SUFFIXes), which,
if we were to expand all those possibilities out would literally
create several MILLION intrinsics.

Have you looked at the current scalable vector implementation?

briefly, yes. i also helped with some review insights when RVV was
being added, although that was a brief glimpse into a massive world
where i was (and still am) constrained, unfortunately, by time and
resources, much as i would love that to be otherwies.

It allows a set of operations on open-ended vectors that are controlled by a predicate, which is possibly the "augmentation" that you're looking for?

no. what is happening there is that it is a reflection of the
limitations of the current ISAs. i can say with 100% certainty that
the SVE implementation will not have been designed to take SVP64 into
consideration.

the reason is actually very simple and straightforward: at the time
LLVM SVE was added, SVP64 did not even exist.

so for example, let us take the new feature added in LLVM SVE: reduction.

most Vector ISAs add *explicit* reduction operations. NEC SX-Aurora
for example has reduce-add, reduce-multiply, reduce-OR, reduce-AND,
reduce-XOR, and that's about it.

SVP64 has:

   * reduction as a fixed (paralleliseable) schedule
   * base operation.

you can LITERALLY apply reduction to.... to... "llvm.maximum" scalar
operation, or to... divide or subtract (or other non-commutative
operation) if you really really want to, and the ISA will go, "ok,
done. next?".

    sv.fmax/MR FRT.v, FRA.v, FRB.v # MR means "map-reduce mode"

you can apply parallel reduction to Power ISA v3.0 Condition Register
operations, "crand" or "cror" or "crnor".

    sv.crand/MR BT.v, BA.v, BC.v

you can even apply parallel reduction to single-argument instructions
if you really, really want to: we're not going to stop that from
happening, because somebody might find it useful given the fact that
the parallel-reduction is on a fixed Power-2-halving Schedule that
could have practical uses, and the hardware is *required* to write out
all intermediate values into a *vector* result.

you can even apply parallel reduction Schedules to triple-argument
instructions (FMA), however there it gets tricky and complicated (and
i haven't thought it through, fully, what it actually means, i.e.
whether it's useful). certainly if the MUL register argument is
considered scalar and the others Vector, that is actually useful
(performs repeated cumulative multiply as part of the Schedule).

does this help illustrate what i mean by "augmentation"? there is a
"base" (scalar) operation, you "augment" it, and it *becomes*
SIMD-like, *becomes* Vector-like, *becomes* predicated, *becomes*
Swizzled, *becomes* reduced.

the development of LLVM SVE would not have taken this possibility into
account, because, put simply, it is plain common sense in something as
complex as LLVM not to waste time writing code for something that does
not have a real-world use-case.

the issue is that this is a massive intrusive change, effectively a
low-level redesign of LLVM IR internals for every single back-end.

Not necessarily.

this would be fantastic (and a huge relief) if it can be so arranged. one of my
biggest concerns is that what i am advocating is at such a fundamental level
that it could, if done incorrectly, be extremely disruptive.

however even now as i think about it on-the-fly, if the proposal is as simple
as adding c++-like "optional named arguments" to (all) base scalar LLVM
intrinsics, then, i think that would work extremely well. it would have
zero impact on other ISAs, which is a huge plus.

For example, scalable vectors are being introduced in a way that non-scalable back-ends (mostly) won't notice.
And it's not just adding a few intrinsics, the very concept of vectors was changed.
There could be a (set of) construct(s) for your particular back-end that is invisible to others.

the problem is that all other Vector ISAs have constrained themselves to 32 bit
(or, for GPU ISAs, often 48 or 64). they *explicitly* add *explicit*
O(N) opcodes.
RVV adds 192 *explicit* opcodes, embedded into a MAJOR 32-bit opcode
specifically dedicated for use by RVV, and that was part of its original design.

ARM, likewise, will have done something similar, with SVE and SVE2.

the problem with that approach is that it is extremely limiting in the possible
permutations / combinations of *potential* instructions that *could* exist,
if there was not such a limit of trying to cram into a 32-bit space.

[it does have to be said, however, that there are some serious practical
benefits to limiting the possibilities of an ISA: validation and verification
of silicon before spending USD 16 million on 7nm masks is a Damn Good
Reason :slight_smile: and it is one that we are going to have to give some serious
thought to: how to verify the hardware for an ISA with literally several
MILLION instructions]

we have left the entirety of the Scalar Power v3.0B ISA alone (which
is 32-bit), called that "base", and added a full 32-bit Prefix (called
SVP64) which contains the Vectorisation Context.

SVP64 is - fundamentally - an O(NxM) ISA, where N ~= 250
and M is ~= 1,000 to 8,000.

actually, it's O(NxMxOxPxQ) where:

* N~=250 is the base scalar Power v3.0B ISA
* M~=1,000-8,000 is the Vectorisation Context
* O~=2^20 (guessing here) is REMAP Schedules and
* P~=2^(3*12) is SWIZZLE Contexts for GPUs (XXYZ, WZZX)
* Q=64 (Vector Length, VL)

thus for example with Twin-Predication applied to e.g. llvm.sext or llvm.cos we
have implicit back-to-back VGATHER-VSCATTER behaviour *WITHOUT*
needing a pair of LD/ST operations or register MV operations before and
after the Vectorised operation.

we anticipate some extremely powerful compact representations, and to
be honest it may literally take several years for the full
implications of SVP64's
power and flexibility to sink in.

in-register paralleliseable DCT can be done in 9 instructions, and
paralleliseable
in-register-file (small) insertion-sort likely in around 11
instructions thanks to the
Data-Dependent Fail-on-First-Condition Mode.

we can even implement a (small) Vectorised quick-sort, in-register, fully
paralleliseable, in probably about... 20 instructions. it's on my TODO list
to investigate.

Of course, the more invisible things, the harder it is to validate and change intersections of code, so the change must really be worth the extra hassle.

appreciated.

With both Arm and RISCV implementing scalable extensions, that change was deemed worthy and work is progressing.
So, if you could leverage the existing code to your advantage, you'd avoid having to convince a huge community to implement a large breaking change.

the possibility that occurred to me, above, as writing this, of adding
optional arguments
(containing the Vector Augmentation Context) to base scalar llvm intrinsics,
would i believe achieve that.

if any other ISA vendors wanted to use that, they could, as a first pass, map
e.g. llvm.load(optional_predicate=xxx) *onto* llvm.masked.load(....) and thus
avoid huge disruption, and carry that out in an incremental fashion. or not.
at their choice.

there are several variants on this theme of optional arguments of some
description
to the base:

    llvm.add<optional_vector_context>(normal_arguments)

where optional_vector_context is an object of some type that
itself contains optional "augmentation" features. i would advocate
something like this:

    llvm.add(normal_arguments, source_override_width=<8/16/32/64>,

dest_override_width=<8/16/32/64>,

saturation_mode=<signed/unsigned>,
                                                   source_pred=xxx,
dest_pred=yyyy,
                                                   fail_first_mode,
swizzle_src=<XYZW>,

REMAP_schedules=<RA,RB,RC,RT,RS>,

scalar_or_vector_regs=<RA=v/s,RB=v/s,

                   RC=v/s,RT=v/s,RS=v/s>)

can you imagine expanding all of those out into a declared (flat) list
of intrinsics?
what that would do to LLVM SVE if we tried?

the "augmentation" list is absolutely massive and starts to give some
idea of why
LLVM SVE, as designed, simply won't cope, and why we have to think about this
differently.

the thing is: from the study i've made of other ISAs, i can say that with
near 100% certainty that there *will* be a direct map to all of the existing
LLVM SVE intrinsics recently added *and to those of all SIMD ISAs as well*

and, what i expect to happen is that instead of a massive list of thousands
of SIMD intrinsics for e.g. x86, it will reduce down to a fraction of what
is in LLVM x86 backend right now. in fact, i expect the exact same reduction
to occur for *all* Packed and Predicated SIMD ISAs supported by LLVM.

that will have both reduction in maintenance burden, and it should, in
theeoorry,
reduce compile times as well. in theory. in practical terms it depends what
the impact is of the "optional" arguments.

hmmm, that will need some thought.

even the Mill i believe could benefit, from being able to map much more
closely to the actual underlying ISA, which *only* has "ADD" (not
ADD8/16/32/64),
because they could potentially add an "auto" or "implicit" option to the
source width / dest width arguments, which would be much more in line
with how the actual ISA itself works (implicit tagged - polymorphic - registers)

https://millcomputing.com/docs/compiler/

And you'd also give us one more reason for the scalable extension to exist. :slight_smile:

:slight_smile:

as i mentioned at the start, with that list of ISAs, there _do_ exist
ofher Vector
ISAs and actual hardware implementations, out there: NEC SX-Aurora has
been shipping for decades, now - first implementations were April 1983!

   https://en.wikipedia.org/wiki/NEC_SX

here's some background:

    https://sx-aurora.github.io/

and yes, they do have a Vector Extension variant of llvm:

    https://sx-aurora.github.io/posts/llvm-ve-rv/

i do hope at some point that they come out of the woodwork and participate
in LLVM SVE. and that the product continues to ship, it's pretty incredible
and i am delighted that NEC has had a strong enough customer base to
keep on selling it and maintaining SX-Aurora.

Mitch Alsup's MyISA 66000 will need gcc and llvm at some point, and it is
another ISA with a form of Scalable Vectors - one that has been specially
designed to "thunk" down to simple scalar hardware.

thank you, Renato, for responding, it's given me the opportunity to explain
a bit more in-depth. feel free to cc libre-soc-dev in future, we don't mind
[relevant!] cross-posts.

warmest,

l.

apologies to all for the follow-up, i realised i joined iteration and reduction together as if they were the same concept: they are not.

Iterative Sum when carried out on add of a Vector containing all 1s results in a Pascal Triangle Vector output

example of existing hardware that has actual Iteration instructions: Section 8.15 of SX-Aurora ISA guide, p8-297, the pseudocode for Iterative Add:

for (i = 0 to VL-1) {
    Vx(i) ← Vy(i) + Vx(i-1), where Vx(-1)=Sy
}

where if Vx and Vy are the same register you get the Pascal Triangle effect.

https://www.hpc.nec/documents/guide/pdfs/Aurora_ISA_guide.pdf

SVP64 does not have this *specifically* added: it is achieved incidentally by issuing an add where the src and dest registers differ by one (SVP64 sits on top of a rather large scalar regfile, 128 64 bit entries)

   sv.add r1, r1, r0

we did however need to add a "reverse gear" (for (i = 0 to VL-1)) which was needed for ffmpeg's MP3 CODEC ironically to *avoid* the Pascal Triangle effect (and not need to copy a large batch of registers instead)

can anyone say if LLVM SVE happened to add Iteration?

l.

again, apologies, a follow-up: i'd like to keep the conversation going (with everyone).

a reminder / summary of the proposal:

     all basic *scalar* LLVM intrinsics extend with *optional* arguments that
     provide Vector / SIMD Augmentation Context.

the benefit being that the number of intrinsics needed now and in the future in LLVM is dramatically reduced

first, a clarification: Renato, you asked if the shuffle capability of LLVM SVE was sufficient: i replied slightly flippantly asking if shuffle-{any-arith-op} existed as a concept (apologies for that).

SVP64 does not have shuffle-{any-arith-op} however being targetted at 3D and Video it does have Swizzle and a new concept: REMAP. Swizzle can be applied through prefixing to all source registers. it is well-known in the GPU world, especially how important it is, and does not need describing.

REMAP is a completely new concept. an algorithmic "remapping" is applied to the normally sequentially-incrementing Vector Element indices. useful limited easy-to-implement "remappings" are being developed, such as Matrix Schedules (0 3 6 1 4 7 2 5 8) and RADIX-2 FFT/DCT Butterfly Schedules.

normally Shuffle is limited to either memory operations or to register MV operations, and both are inherently supported by SVP64 through Vectorisation of base scalar operations: Indexed LD/ST for example.

my point is that whilst SVP64 supports the "normal" expected type of Shuffle Operations expected of Vector ISAs (Vector-Indexed-LD, Indexed-Reg-MV) it also has GPU style Swizzle (a limited type of shuffle for short vectors up to length 4) and REMAP.

thus, there is a case even for adding shuffle-augmentation to base LLVM intrinsics as optional arguments.

the one that *is* much more general purpose but was not mentioned except in passing was VGATHER-VSCATTER.

in all other Vector ISAs these are usually either memory-only or Reg-MV operations (or both). it's usually done with Predicate Masks. In SVP64, surprise: both VGATHER and VSCATTER are abstracted-out concepts that can apply to almost every operation. this is not possible to do all thr time, but when *both* are applied (VGATHER to the source regs or memory, VSCATTER to the dest), we call that "Twin Predication".

thus, again, we would propose adding *both* a source predicate mask *and* destination predicate mask to base llvm intrinsics, as optional arguments.

the other concept is slightly odd: element-width overrides even on operations where the source registers are specified at a fixed width already. this one i am slightly uncertain about.

we have a Mode in SVP64 called "Saturate" which has sub-options Signed and Unsigned. the rules for this took us some time to derive: eventually we realised that the rule has to be that the arithmetic operation appears to take place at *infinite* precision, followed up by truncation to the min/max of the output bitwidth.

all other definitions turned out to be problematic in some way (particularly for multiply or power).

what i am not certain about is whether it is perfectly sufficient to use standard base LLVM intrinsics, and count on source register type and return type as the SVP64 src width and dest width, and simply add optional arguments for signed/unsigned saturation.

however what is clear to me is that there is very little conceptual limit as to what can be added as optional arguments to base intrinsics. it would be up to ISA Maintainers to define what they can provide in hardware.

i would very much love to hear from other ISA Maintainers as to whether the ISA they are responsible for could benefit from this approach, both in the 3D GPU World as well as standard non-GPU: ARM SVE2, x86, AMDGPU, MIPS, ppc64, SX-Aurora, everyone.

SIMD ISAs would have an optional argument specifying the (fixed) length. Cray-style Scalar Vector ISAs would have an optional argument specifying that the length was variable.

the invitation is therefore to see if this idea, of adding optional Vectorisation Context to base llvm intrinsics, has merit across the entire LLVM community, and, if it does, what would it look like?

key question: what impact would a large number of optional arguments to LLVM base intrinsics have, on performance and memory consumption? would it be beneficial or adverse? i honestly have no idea.

another question: if a given ISA does not provide a particular hardware feature (saturation let us say) then should this be declared in some fashion such that LLVM avoids emitting llvm.add(args, sat=signed)

OR

should the functionality be provided anyway by way of soft-passes behind the scenes? i.e. the lack of hardware saturation would result in IR being emitted that ultimately performed the saturation using multiple assembly operations.

given that this latter approach would effectively imply that *all* LLVM IR backends "supported" SIMD and Vectorisation (emulated through IR passes for non-Vector non-SIMD hardware) it would need some serious thought.

l.