[EXT] Re: [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

JinGu:

I’m not Graham, but you might find the following link a good starting point.

https://community.arm.com/developer/tools-software/hpc/b/hpc-blog/posts/technology-update-the-scalable-vector-extension-sve-for-the-armv8-a-architecture

The question you ask doesn’t have a short answer. The compiler and the instruction set design work together to allow programs to be compiled without knowing until run-time what the vector width is (within limits of min and max possible widths). One key restriction is that certain storage classes can’t contain scalable vector types, like statically allocated globals for example.

Joel Jones

Hi Joel,

Thanks for your kind guide.

> https://community.arm.com/developer/tools-software/hpc/b/hpc-blog/posts/technology-update-the-scalable-vector-extension-sve-for-the-armv8-a-architecture

I will have a look the post.

One key restriction is that certain storage classes can’t contain scalable vector types, like statically allocated globals for example.

It was one of what I want to know.

Thanks,
JinGu Kang

Hi All,

I have read the links from Joel. It seems one of its main focus is vectorization of loop with vector predicate register. I am not sure we need the scalable vector type for it. Let’s see a simple example from the white paper.

1 void example01(int *restrict a, const int *b, const int *c, long N)
2 {
3 long i;
4 for (i = 0; i < N; ++i)
5 a[i] = b[i] + c[i];
6 }

We could imagine roughly the vectorized loop with mask on IR level as below.

header:

%n.broadcast.splatinsert = insertelement <8 x i32> undef, i32 %n, i32 0

%n.vec = shufflevector <8 x i32> %broadcast.splatinsert, <8 x i32> undef, <8 x i32> zeroinitializer

br label %loop.body

loop.body:

%index = phi i32 [ 0, %header ], [ %index.next, %loop.body ]

%mask.vec = phi <8 x i1> [ <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, %header ], [ %mask.vec.next, %loop.body ]

%a.addr = getelementptr inbounds i32, i32* %a, i32 %index

%b.addr = getelementptr inbounds i32, i32* %b, i32 %index

%c.addr = getelementptr inbounds i32, i32* %c, i32 %index

%b.val = call <8 x i32> @llvm.masked.load.v8i32.p0v8i32(<8 x i32>* %b.addr, i32 4, <8 x i1> %mask.vec, <8 x i32> undef)

%c.val = call <8 x i32> @llvm.masked.load.v8i32.p0v8i32(<8 x i32>* %c.addr, i32 4, <8 x i1> %mask.vec, <8 x i32> undef)

%a.val = add <8 x i32> %b.val, %c.val

call void @llvm.masked.store.v8i32.p0v8i32(<8 x i32> %a.val, <8 x i32>* %a.addr, i32 4, <8 x i1> %mask.vec)

%index.broadcast.splatinsert = insertelement <8 x i32> undef, i32 %index, i32 0

%index.vec = shufflevector <8 x i32> %index.broadcast.splatinsert, <8 x i32> undef, <8 x i32> zeroinitializer

%index.next.vec = add <8 x i32> index.vec, <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>

%lane.cond.vec = icmp lt <8 x i32> %index.next.vec, %n.vec

%mask.vec.next = and <8 x ii> %lane.cond.vec, %mask.vec

%index.next = add i32 index, 8

%cond = icmp eq i64 %index.next, %n

br i1 %cond, label %loop.exit, label %loop.body

loop.exit:

Above vectorized loop does not need tail loop. I guess we could map the %mask.vec to predicate register as native register class on ISelLowering level. The conditional branch could also be mapped to ‘whilexx’ and 'b.xxx on MIR level. In order to get vector type, we could calculate cost model for target as llvm’s vectorizers. If SVE focuses on loop vectorization mainly, I am not sure why the scalarable vector type is needed… From my personal opinion, the VLA programming model could add ambiquity and complexity to compiler because it is not concrete type at compile time… I am not expert for SVE and VLA. I could miss something important. If I missed something, please let me know.

Thanks,
JinGu Kang

Hi JinGu,

Above vectorized loop does not need tail loop. I guess we could map the %mask.vec to predicate register as native register class on ISelLowering level. The conditional branch could also be mapped to 'whilexx' and 'b.xxx on MIR level. In order to get vector type, we could calculate cost model for target as llvm's vectorizers. If SVE focuses on loop vectorization mainly, I am not sure why the scalarable vector type is needed... From my personal opinion, the VLA programming model could add ambiquity and complexity to compiler because it is not concrete type at compile time... I am not expert for SVE and VLA. I could miss something important. If I missed something, please let me know.

SVE doesn't have a prescribed vector length -- the size of a vector register is hardware dependent, and a single compiled VLA loop will execute correctly on hardware with different vector lengths. So if you ran the same code on a cpu with 128b SVE vectors and another with 256b vectors, you have the potential to double the work done per cycle on the second cpu without changing the code (there's lots of factors that could prevent performance from scaling nicely, but that's not directly related to the vector length). SVE's current maximum defined size is 2048b, though I suspect it'll be a quite a while before we see vectors of that size in commodity hardware. Fujitsu's A64FX will use 512b vectors.

We used predication in the example to show a loop without a scalar tail, but it's not necessary to use the ISA in that manner.

The RISC-V V extension is similar, though has a few extra bits to worry about. You'll need to ask Robin Kruppe if you want more details on that.

As far as your question on memory layout is concerned, we don't expect many base 'load' or 'store' instructions to be used for scalable vector types, and for an extremely conservative approach you could consider such a memory operation to potentially alias all memory in a given address space. Instead, we expect to always use masked load and store intrinsics (or gather/scatter intrinsics), and we will need to improve parts of AA to take the masks into account.

-Graham

Hi Graham,

Thanks for your kind explanation.

There was internal discussion about it. If possible, can you let me know the Clang/LLVM CodeGen patches for the vector type on phabricator please? I would like to check what kinds of the restrictions the type causes on Clang/LLVM.

Thanks,
JinGu Kang

Hi JinGu,

Hi Graham,

Thanks for your kind explanation.

There was internal discussion about it. If possible, can you let me know the Clang/LLVM CodeGen patches for the vector type on phabricator please? I would like to check what kinds of the restrictions the type causes on Clang/LLVM.

There's a list of 14 patches at the bottom of the RFC which provided a simple demonstration of the type being used, but some are now obsolete. The first patch in that series (https://reviews.llvm.org/D32530) has now been merged, and there's plenty of comments on the review which relate to restrictions on the type. For the rest of those patches my team are now preparing more complete implementations of SVE codegen, but none have landed on phabricator yet.

There's also https://reviews.llvm.org/D53137 which may be of interest -- that shows an extension to LLVM to measure the size of types which might be scalable so that we don't confuse them with non-scalable types. I'm working on an extended version of that at the moment, which will introduce a few new helper functions to abstract away from plain sizes.

Hope that helps a little,

-Graham

Hi Graham,

I appreciate for your kind guide.

I have read the reviews. If possible, can I ask you how VLA supports the scalable vector type on CLANG for C/C++ language or something like that please? Do you have similar approach like the IR type on AST? Additionally, I would like to know how you will support the legalization for the scalable vector type and its vector operations like shufflevector, insertelement and extractelement without especially predicate vector register. If I ask the wrong questions, I am sorry for that.

Thanks,
JinGu Kang