Neat! I did not know that about the V extension. So this sounds as though the V extension would like support for <VL x <4 x float>>-style vectors as well.
Yes. In general, support for <VL x <M x iN>> where M is in {2,4,8} and
N could be as small as 1 though support for smaller than i8 is
optional. (no distinction is drawn between int and float in the vector
configuration -- that's up to the operations performed)
We are currently thinking of defining the extension in terms of a 16-bit prefix that changes standard 32-bit instructions into vectorized 48-bit instructions, allowing most future or current standard/non-standard extensions to be vectorized, rather than having to wait for additional extensions to have vector versions added to the V extension (one reason we are not using the V extension instead), such as the B extension.
Do you mean instructions following the standard 48-bit encoding
scheme, that happen to contain a standard 32 bit instruction as a
payload?
Having a prefix rather than, or in addition to, a layout configuration register allows intermixing vector operations on different group/element sizes without having to constantly change the vector configuration every few instructions.
No real difference. The standard RISC-V Vector extension is intended
to allow exactly those changes to the vector configuration every few
instructions. It's mostly the microcontroller people coming from
DSP/SIMD who want to do that, so it's up to them to make that
efficient on their cores -- they might even do macro-op fusion on it.
Big OoO/Supercomputer style code compiled from C/FORTRAN in general
doesn't want to do that kind of thing.
Example code that changes the configuration within a loop to do 16 bit
loads, 16x16->32 multiply, then 32 bit shift and store:
# Example: Load 16-bit values, widen multiply to 32b, shift 32b result
# right by 3, store 32b values.
loop:
vsetvli a3, a0, vsew16,vlmul4 # vtype = 16-bit integer vectors
vlh.v v4, (a1) # Get 16b vector
slli t1, a3, 1
add a1, a1, t1 # Bump pointer
vwmul.vs v8, v4, v1 # 32b in <v8--v15>
vsetvli x0, a0, vsew32,vlmul8 # Operate on 32b values
vsrl.vi v8, v8, 3
vsw.v v8, (a2) # Store vector of 32b
slli t1, t1, 2
add a2, a2, t1 # Bump pointer
sub a0, a0, a3 # Decrement count
bnez a0, loop # Any more?
(this example is probably only useful if 16x16->32 mul is
significantly faster than 32x32->32, otherwise you'd just load and
sign extend the 16 bit data into 32 bit elements)
A note on vector register numbering. There are registers 0..31. If you
specify vlmul4 then only v0,v4,v8,v12,v16,v20,v24,v28 are valid
register numbers. If you specify vlmul8 then only v0,v8,v16,v24 are
valid.