New intrinsic to get number of lanes in SIMD vectors

tl;dr:
I’m wondering whether is makes sense to have something like a new @llvm.vector.numelements() intrinsic function, which returns the number of elements in a SIMD vector either at compile-time if known (fixed-sized vectors like x86, NEON) or at runtime (variable-sized vectors like SVE, RISC-V V).

Long version:
I’ve been playing around with LLVMs vector types (via vector_size(n)) for cross-platform SIMD code. One issue that came up is the distinction between fixed- and variable-sized vectors. Let’s say I want to have a simple loop to add two arrays with SIMD instructions, along the lines of this C++ code:

// Can be fixed-sized (e.g., 16-byte x86 register)
using VecT __attribute__((vector_size(16))) = int;
// OR variable-sized vector (e.g., SVE or RISC-V V)
using VecT = svuint32_t;

// Generic loop on both platforms
void add_vectors(int* a, int* b, int* c) {
  for (int i = 0; i < N; i += ???) { // <-- how to increment i?
    *(VecT*)c = (VecT&)a[i] + (VecT&)b[i];
    c += i;
  }
}

Godbolt link for example. There is a bit of loop-unrolling in x86, the SVE assembly is a bit easier to read.

The main issue with wanting this to work with both fixed- and variable-sized vectors is that the loop increment is known for one at compile time, e.g., via sizeof(VecT) / sizeof(int). For scalable vectors, this is only known at runtime, so we need a method to get the number of vector elements (or lanes) at runtime. Other than that, the code works on, e.g., x86 and SVE, as shown in the godbolt link.

We could write a num_lanes() method for each platform (as shown in godbolt) that does this for us. I was wondering if it makes sense to have a LLVM intrinsic that does this for us. Something like @llvm.vector.numelements, which can be wrapped in Clang with something like __builtin_vectorelements() (similar naming to __builin_convertvector()). With both fixed and scalable vectors, programming for both will become more and more common. So maybe it makes sense to provide a method for users to avoid a num_lanes() method on their end for each platform and vector type. All the information needed is available in LLVM, so it feels unnecessary to duplicate all the logic in user code again. For fixed-sized vectors, this is rather trivial to implement, and for scalable vectors, I guess we would need to find the right call depending on the sie of the vector elements.

Does this make sense at all? Is an intrinsic function the right thing to use here? Or is there maybe already a way to express this? Maybe we only need a Clang wrapper here instead of an LLVM intrinsics?

I’m happy to discuss some ideas/thoughts and I’m also willing to implement this if there is interest in it.

Best,
Lawrence

I think this is a great idea!

How would you pass the element type to the intrinsic? I think that __builtin_convertvector() does not emit intrinsics.

Glad to hear there is interest in this. I haven’t thought about the C++ side too much yet. But I guess a builtin could take either a vector instance or a vector type. builtin_convertvector actually takes both, i.e., the input vector and the target type. So both is possible and supported. Probably it makes more sense to take the type though, as it is a property of the type, not the instance.

To be clear, by intrinsic I meant the LLVM IR intrinsic not the C intrinsic (overloaded word).

An LLVM intrinsic is not necessary or desirable for this. The number of lanes is either a constant or a vscale intrinsic multiplied by a constant. The clang builtin can generate this directly.

@nikic Thanks for clarifying. This is exactly the part I was unsure about. I’ll re-phrase the question and post it in the Clang topic to get some feedback on this form a C/C++ frontend view.

That works, just please don’t hard-code any more target knowledge into clang. This info needs to come from the backend.