On large vectors

I have a simple expression-evaluation language using LLVM (it's at
Calculon, if anyone's interested). It has pretty
primitive support for 3-vectors, which I'm representing as a <3 x float>.

One of my users has asked for proper n-vector support, and I agree with
him so I'm adding that. However, he wants to use quite large vectors.
He's mentioned 30 elements, and I think he would like 50.

At what point do vectors stop being useful?

I've experimented with llc and, while large vectors *work*, the code
that's produced looks pretty scary (and I don't know enough about SSE
and AVX instructions to evaluate whether it's good or not). I can think
of various things I could do: passing input parameters by reference
instead of value, ditto with output parameters, falling back to
aggregates or plain arrays... but all this adds complexity to the code
generator, and I don't know if it's worth it (or at what point it
*becomes* worth it).

My JITted code consists of a single module with a single entry point,
and is all compiled up front with aggressive inlining. Vector parameters
are only ever used internally. I've noticed that LLVM does an excellent
job of optimising. Can I just use the naive approach and trust LLVM to
get on with it and make it work?

I can see why freakishly large vectors would produce bad code. The type <50 x float> would be widened to the next power of two, and then split over and over again until it fits into registers. So, any <50 x float> would take 16 XMM registers, that will be spilled. The situation with integer types is even worse because you can truncate or extend from one type to another.

In that sense, an inner loop with sequential access would be vectorized
into much better code than having a <50 x float>.

Whether this is something LLVM could do with <50 x float> or should always
be up to the front-end developer, I don't know. It doesn't seem
particularly hard to do it in the vectorizer, but it's also probably won't
be high on the TODO list for a while.

cheers,
--renato

Renato Golin wrote:
[...]

    I can see why freakishly large vectors would produce bad code. The
    type <50 x float> would be widened to the next power of two, and
    then split over and over again until it fits into registers. So,
    any <50 x float> would take 16 XMM registers, that will be spilled.
    The situation with integer types is even worse because you can
    truncate or extend from one type to another.

In that sense, an inner loop with sequential access would be vectorized
into much better code than having a <50 x float>.

Whether this is something LLVM could do with <50 x float> or should
always be up to the front-end developer, I don't know. It doesn't seem
particularly hard to do it in the vectorizer, but it's also probably
won't be high on the TODO list for a while.

I have actually been reading up on the vectorizer. I'm using LLVM 3.2,
so the vectorizer isn't turned on by default. Would it be feasible to
explicitly *not* use vectors --- switching to aggregates instead --- and
then rely on the vectorizer to autovectorize the code where appropriate?

I have actually been reading up on the vectorizer. I'm using LLVM 3.2,
so the vectorizer isn't turned on by default.

Not just that, but there is also a lot more coverage since last release
(including floating points).

Would it be feasible to

explicitly *not* use vectors --- switching to aggregates instead --- and
then rely on the vectorizer to autovectorize the code where appropriate?

It depends. If you use vectors that are within the boundaries of the
target's vector sizes, than you can possibly generate better code directly.
For instance, if your array has only 3 elements <3 x float>, the vectorizer
could think it's not worth to change it. But if you generated all in vector
types all the way through, the cost of using vector engines is reduced, and
it may be worth even if the vectorizer thinks otherwise. As usual, this is
not always true, as sometimes the vectorizer sees patterns you don't, or
can add run-time checks to do selective vectorization and so on.

In the long term, I think it's best to expect the compiler to do the hard
work for you, and teach the compiler to recognize such cases, than add
special cases on your own programs. As of now, though, you may have to
balance.

It'd be interesting to see a comparison of IRs and benchmarks for programs
running with long vectors vs. arrays, and short non-power-of-two vectors
vs. arrays.

cheers,
--renato

Whether this is something LLVM could do with <50 x float> or should
always be up to the front-end developer, I don't know. It doesn't seem
particularly hard to do it in the vectorizer, but it's also probably
won't be high on the TODO list for a while.

I have actually been reading up on the vectorizer. I'm using LLVM 3.2,
so the vectorizer isn't turned on by default. Would it be feasible to
explicitly *not* use vectors --- switching to aggregates instead --- and
then rely on the vectorizer to autovectorize the code where appropriate?

As a pragmatic approach to developing things, I'd say that it's best to view LLVM as a compiler that won't change your code in big ways (even if one or two passes/plugins might). So to rely on the autovectorizer you really want to be producing code that is easy to for it to determine is vectorizable. So while your code-generator may not actually be using vectors, I'd think you'd need to be thinking about vectorizability throughout it to be aware of the way you're expressing things in LLVM IR. I'd be particularly wary of using arrays with loops on each "conceptual vector" operation, since the compiler may well fail to fuse the loops and hence see the opportunities. (But then I've been thinking a lot about loop fusion recently, so there's possibly an idée fixe there.)

Regards,
Dave