RFC: Adding vscale vector types to C and C++

LLVM now supports a "scalable" vector type:

    <vscale x N x ELT> (e.g. <vscale x 4 x i32>)

that represents a vector of X*N ELTs for some runtime value X
[https://reviews.llvm.org/D32530]. The number of elements is therefore
not known at compile time and can depend on choices made by the execution
environment. This RFC is about how we can provide C and C++ types that
map to this LLVM type.

The main complication is that, because the number of elements isn't
known at compile time, "sizeof" can't work in the same way as it does
for normal vector types. Our suggested fix for this is to separate the
concept of "complete type" into two:

* does the type have enough information to construct objects of that type?

    For want of a better term, types that have this property are
    "definite" while types that don't are "indefinite".

* will it be possible to measure the size of the type using "sizeof",
  once the type is definite?

    If so, the type is "sized", otherwise it is "sizeless".

"Complete" is then equivalent to "sized and definite". The new scalable
vectors are definite but sizeless, and so are never complete.

We can then redefine certain rules to use the distinction between
definite and indefinite types rather than complete and incomplete types.
(This is a simple change to make in Clang.) Things like "sizeof" and
pointer arithmetic continue to require complete types, and so are invalid
for the new types. See below for a more detailed description.

We're also proposing to treat the new C and C++ types as opaque built-in
types rather than first-class vector types, for two reasons:

(1) It means that we don't need to define what the "vscale" is for
    all targets, or emulate general vscale operations for all targets.
    We can just provide the types that the target supports natively,
    and for which the target already has a defined ABI.

(2) It allows for more abstraction. For example, SVE has scalable types
    that are logically tuples of 2, 3 or 4 vectors. Defining them as opaque
    built-in types means that we don't need to treat them as single vectors
    in C and C++, even if that happens to be how LLVM represents them.
    Building tuple types into the compiler also means that we don't need
    to support scalable vectors in structures or arrays.

In case this looks familiar...

Hi Richard,

I appreciate for your effort with this RFC.

> Using intrinsics might seem old-fashioned when there are various
frameworks that express data-parallel algorithms in a more abstract way,
or libraries like P0214 (std::simd) that provide mostly performance-
portable vector interfaces.  But in practice, each vector architecture
has its own quirks and unique features that aren't easy for the compiler
to use automatically and aren't performance-portable enough to be part
of a generic interface.  So even though target-neutral approaches are a
very welcome development, they're not a complete solution.  Intrinsics
are still vital when you really want to hand-optimise a routine for a
particular architecture.  And that's still a common requirement.

For example, Arm has been porting various codebases that already support
AArch64 AdvSIMD intrinsics to SVE2.  Even though AdvSIMD and SVE2 have
some features in common, the routines for the two architectures are
often significantly different from each other (and in ways that can't be
abstracted by interfaces like std::simd).  We need to have direct access
to SVE2 features for this kind of work.

I am +1000 with using intrinsic functions. Internally, there was discussion about supporting this type. For instance, how we can implement vector swizzle like “.xyz” or “hi/lo”? At this moment, CLANG uses shuffle vector to implement it. I guess we would want to swizzle vector per vector unit which is unknown at compile time. I am not sure we can implement it efficiently with current LLVM’s IR vector operations. We could miss instruction combine or other optimization opportunities. However, I guess it would not be easy for the passes to handle this type’s operations. If I missed something, please let me know.

Thanks,
JinGu Kang

> These are all negative reasons for (1) being the best approach.
A more positive justification is that (1) seems to meet the requirements
in the most efficient way possible.  The vectors can use their natural
(native) representation, and the type system prevents uses that would
make that representation problematic.

Also, the approach of starting with very restricted types and then
specifically allowing certain things should be more future-proof
and interact better with other (unseen) language extensions.  By default,
any language extension would treat the new types like other incomplete
types and choose conservatively-correct behavior.  It would then be
possible to relax the rules if this default behavior turns out to be
too restrictive.

I wondered we can define the restriction of this type well on C/C++ standard. From my personal opinion, this RFC suggests clear concept. I am +1000 for (1). Additionally, for the variable length array, I agree with your opinion. One thing I would like to mention is that we could use ‘dynamic_alloc’ for the local variable of this type like other variable sized objects. If we use it, I guess we don’t need to change backend code.

Thanks,
JinGu Kang

Hi,

Thanks for the reply and Phabricator review.

JinGu Kang <jingu@codeplay.com> writes:

Using intrinsics might seem old-fashioned when there are various
frameworks that express data-parallel algorithms in a more abstract way,
or libraries like P0214 (std::simd) that provide mostly performance-
portable vector interfaces. But in practice, each vector architecture
has its own quirks and unique features that aren't easy for the compiler
to use automatically and aren't performance-portable enough to be part
of a generic interface. So even though target-neutral approaches are a
very welcome development, they're not a complete solution. Intrinsics
are still vital when you really want to hand-optimise a routine for a
particular architecture. And that's still a common requirement.

For example, Arm has been porting various codebases that already support
AArch64 AdvSIMD intrinsics to SVE2. Even though AdvSIMD and SVE2 have
some features in common, the routines for the two architectures are
often significantly different from each other (and in ways that can't be
abstracted by interfaces like std::simd). We need to have direct access
to SVE2 features for this kind of work.

I am +1000 with using intrinsic functions. Internally, there was discussion
about supporting this type. For instance, how we can implement vector swizzle
like ".xyz" or "hi/lo"? At this moment, CLANG uses shuffle vector to implement
it. I guess we would want to swizzle vector per vector unit which is unknown at
compile time. I am not sure we can implement it efficiently with current LLVM's
IR vector operations. We could miss instruction combine or other optimization
opportunities. However, I guess it would not be easy for the passes to handle
this type's operations. If I missed something, please let me know.

Yeah, the initial vscale patch that was applied to the LLVM repo only
supported two kinds of index mask for shufflevectors: zeroinitializer
or undef. Arm's internal implementation supports much more than that,
but this is still an area that needs to be agreed with the community.

In general, the Clang implementation of the SVE built-in functions
uses a combination of generic IR operations and target-specific LLVM
intrinsics. At the moment the permute-like functions use intrinsics.

Thanks,
Richard

JinGu Kang <jingu@codeplay.com> writes:

These are all negative reasons for (1) being the best approach.
A more positive justification is that (1) seems to meet the requirements
in the most efficient way possible. The vectors can use their natural
(native) representation, and the type system prevents uses that would
make that representation problematic.

Also, the approach of starting with very restricted types and then
specifically allowing certain things should be more future-proof
and interact better with other (unseen) language extensions. By default,
any language extension would treat the new types like other incomplete
types and choose conservatively-correct behavior. It would then be
possible to relax the rules if this default behavior turns out to be
too restrictive.

I wondered we can define the restriction of this type well on C/C++ standard.
From my personal opinion, this RFC suggests clear concept. I am +1000 for (1).
Additionally, for the variable length array, I agree with your opinion. One
thing I would like to mention is that we could use 'dynamic_alloc' for the
local variable of this type like other variable sized objects. If we use it, I
guess we don't need to change backend code.

Yeah, dynamic allocation works well for local variables, and for example
we'd emit the equivalent of:

    %var = alloca <vscale x 4 x i32>

for:

    svint32_t var;

But having the vscale LLVM type is important for function interfaces,
where we need to able to pass and return vectors by value.

Thanks,
Richard

Hi Richard,

But having the vscale LLVM type is important for function interfaces,
where we need to able to pass and return vectors by value.

I guess we can use ‘byval’ for parameter passing and ‘sret’ for return value. It means that we can use stack for the ABI and the stack could be allocated by dynamic alloc. As you know, CLANG has Target ABI Interface and I guess we can generate the attributes with the interface.

Regards,
JinGu Kang

Ping :slight_smile:

Richard Sandiford <richard.sandiford@arm.com> writes:

Ping*2.

Quick summary of the patches backing the RFC:

* https://reviews.llvm.org/D62960
  Add the SVE types themselves. Thanks for the reviews on this one!

* https://reviews.llvm.org/D62961
  [AST] Add type queries for scalable types.

* https://reviews.llvm.org/D62962
  Main patch, including documentation & tests. Mostly affects Sema.

Richard Sandiford <richard.sandiford@arm.com> writes: