Hi all,
Thanks Renato for the prod.
We (Arm) have had more off-line discussions with some members of the
community and they have expressed some reservations on adding scalable
vectors as a first class type. They have proposed an alternative to enable
support for C-level intrinsics and autovectorization for SVE.
While Arm's preference is still to support VLA autovec in LLVM (and not just
for SVE; we'll continue the discussion around the RFC), we are evaluating the
details of this alternative -- SVE-capable hardware will begin shipping within
the next couple of years, so we would like to support at least some
autovectorization as well as the C intrinsics by the time that happens.
This alternative proposal has two parts:
* For the SVE ACLE (C-language extension intrinsics), use an opaque type
(similar to x86_mmx, but unsized) and just pass intrinsics straight
through to the backend. This would give users the ability to write
vector length agnostic (VLA) code for SVE without resorting to assembly.
* For SVE autovectorization, use fixed length autovec -- either for a
user-specified length, or multiversioned to different fixed lengths.
I've spent some time over the last month prototyping an opaque type SVE C
intrinsic implementation to see what it would look like; here's my notes so far:
* I initially tried to use a single unsized opaque type.
* I ran into problems with using just a single type, since predicates use
different registers and I couldn't find a nice way of reassigning all
the register classes cleanly.
- I added a second opaque type to represent predicates as a result
- This could be avoided if we added subtype info to the opaque type
(basically minimum element count and elt type); this would mean that
we would either need to represent the count and element type in
a serialized IR form, or that the IR reader would need to be able
to reconstruct the types by reading the types from the intrinsic name
* I ran into a problem with the opaque types being unsized -- the C
intrinsic variables are declared as locals and clang plants
alloca/load/store IR instructions for them
- Could special case alloca/load/store for these types, but that's very
intrusive and liable to break in future
- Could introduce a special 'alloca intrinsic', but would require quite
a bit of code in clang to diverge as well as a custom mem2reg-like
pass just for these types
- I ended up making them sized, but with a size of 0. I don't know if
there's a problem I'll run into later on by doing this.
- While 'load' and 'store' IR instructions are fine for spill/fill memory
operations on the stack, we need to use intrinsics for everything else
since we need to know the size of individual elements -- while there
might not be many big-endian systems in operation, we still need to
support that.
* I reused the same (clang-level) builtin type mechanism that OpenCL does
for the SVE C-level types, and just codegen to the two LLVM types
I now have a minimal end-to-end implementation for a small set of SVE C
intrinsics. I have some additional observations based on our downstream
intrinsic implementation:
* Our initial downstream implementation attempted to do everything in
intrinsics, so would be similar to the opaque type version. However,
we found that we missed several optimizations in the process. Part of
this is due to the intrinsics being higher-level than the instructions
-- things like addressing modes are not represented in the intrinsics,
and with a pure intrinsic approach we miss things like LSR
optimizations.
* We also thought that the need for custom extensions for optimizations
like instcombine on SVE intrinsics would be reduced since someone using
the intrinsics is already going to the trouble of hand-optimizing their
code, but we hadn't appreciated that using C++ templates with constant
parameters and other methods of code generation would be common. As a
result, we now have user requests that operations like 'svmul(X, 1.0)'
be recognized and folded away, and are trying to find better
representations, including lowering to normal IR operations in some cases.
* Some operations can't be represented cleanly in current IR, but should
work well with Simon Moll's vector predication proposal.
Any feedback? I've posted my (very rough) initial work to phabricator:
clang: https://reviews.llvm.org/D59245
llvm: https://reviews.llvm.org/D59246
-Graham