Allow constexpr in vector extensions

Greetings!

This is my first post on this ML so, please, do tell me if I'm doing it
wrong.

I've noticed the following difference between GCC and Clang. Consider
this piece of code

typedef int v4si __attribute__ ((vector_size (16)));

int main()
{
    constexpr v4si a = {1,2,3,4};
    constexpr v4si b = {2,0,0,0};
    v4si c = a + b;
}

It compiles cleanly with both GCC and Clang. However, if I try to make c
constexpr, Clang tells me that operator+ on vectors is not constexpr. I'm
wondering if there's a reason for that. There are no constrains from the
Standard as these are intrinsics, so I see no reason why we couldn't
allow constexpr code benefit from SIMD.

Tom

I see no errors with Clang 3.8, 3.9, 4.0 and GCC 5, 6, 7 on my Haswell Mac. Try setting -std= appropriately, given that constexpr is a C++11 feature.

Jeff

$ cat v4si.cc

typedef int v4si attribute ((vector_size (16)));

int main()

{

constexpr v4si a = {1,2,3,4};

constexpr v4si b = {2,0,0,0};

v4si c = a + b;

}

$ for cxx in /usr/bin/clang++ /usr/local/Cellar/llvm/4.0.1/bin/clang++ /usr/local/Cellar/llvm@3.8/3.8.1/bin/clang+±3.8 /usr/local/Cellar/llvm@3.9/3.9.1_1/bin/clang++ g+±7 g+±6 g+±5 ; do $cxx -std=c++1z v4si.cc && echo $cxx SUCCESS || echo $cxx FAIL ; done

/usr/bin/clang++ SUCCESS

/usr/local/Cellar/llvm/4.0.1/bin/clang++ SUCCESS

/usr/local/Cellar/llvm@3.8/3.8.1/bin/clang+±3.8 SUCCESS

/usr/local/Cellar/llvm@3.9/3.9.1_1/bin/clang++ SUCCESS

g+±7 SUCCESS

g+±6 SUCCESS

g+±5 SUCCESS

$ for cxx in /usr/bin/clang++ /usr/local/Cellar/llvm/4.0.1/bin/clang++ /usr/local/Cellar/llvm@3.8/3.8.1/bin/clang+±3.8 /usr/local/Cellar/llvm@3.9/3.9.1_1/bin/clang++ g+±7 g+±6 g+±5 ; do $cxx -std=c++11 v4si.cc && echo $cxx SUCCESS || echo $cxx FAIL ; done

/usr/bin/clang++ SUCCESS

/usr/local/Cellar/llvm/4.0.1/bin/clang++ SUCCESS

/usr/local/Cellar/llvm@3.8/3.8.1/bin/clang+±3.8 SUCCESS

/usr/local/Cellar/llvm@3.9/3.9.1_1/bin/clang++ SUCCESS

g+±7 SUCCESS

g+±6 SUCCESS

g+±5 SUCCESS

$ for cxx in /usr/bin/clang++ /usr/local/Cellar/llvm/4.0.1/bin/clang++ /usr/local/Cellar/llvm@3.8/3.8.1/bin/clang+±3.8 /usr/local/Cellar/llvm@3.9/3.9.1_1/bin/clang++ g+±7 g+±6 g+±5 ; do $cxx -std=c++14 v4si.cc && echo $cxx SUCCESS || echo $cxx FAIL ; done

/usr/bin/clang++ SUCCESS

/usr/local/Cellar/llvm/4.0.1/bin/clang++ SUCCESS

/usr/local/Cellar/llvm@3.8/3.8.1/bin/clang+±3.8 SUCCESS

/usr/local/Cellar/llvm@3.9/3.9.1_1/bin/clang++ SUCCESS

g+±7 SUCCESS

g+±6 SUCCESS

g+±5 SUCCESS

The original snippet compiles fine. The problem occurs when you try to constexpr the ‘c’ variable:

typedef int v4si attribute ((vector_size (16)));

int main()
{
constexpr v4si a = {1,2,3,4};
constexpr v4si b = {2,0,0,0};
constexpr v4si c = a + b;
}

Thank you, Halfdan. This is exactly what I meant.

Tom

We could certainly allow this. However, last time this was discussed, an
objection was raised: materializing arbitrary vector constants is not cheap
on all targets, and in some cases user code is written in such a way as to
describe how a particular constant should be generated (eg, start with one
easy-to-lower constant, shift by N, add another easy-to-lower constant). If
we constant-fold arbitrary vector operations, that sequence of operations
will be lost in many cases, requiring backend heroics to infer how to
materialize the constant.

I don't know to what extent the above is actually a valid objection,
though: in my testing, LLVM itself will fold together the operations and in
so doing lose the instructions on how to materialize the constant. (And in
some cases, Clang's IR generation code will do the same, because it does
IR-level constant folding as it goes.)

Example: on x86_64, v4si{-1, -1, -1, -1} + v4si{2, 0, 0, 0} can be emitted
as four instructions (pcmpeqd, mov, movd, paddd) totalling 17 bytes, or as
one movaps (7 bytes) plus a 16 byte immediate; the former is both smaller
and a little faster, but LLVM is only able to produce the latter today.
LLVM is smart enough to produce good code for those two constants in
isolation, but not for v4si{1, -1, -1, -1}.

[snip]

It compiles cleanly with both GCC and Clang. However, if I try to make
c constexpr, Clang tells me that operator+ on vectors is not
constexpr. I'm wondering if there's a reason for that. There are no
constrains from the Standard as these are intrinsics, so I see no
reason why we couldn't allow constexpr code benefit from SIMD.

We could certainly allow this. However, last time this was discussed,
an objection was raised: materializing arbitrary vector constants is
not cheap on all targets, and in some cases user code is written in
such a way as to describe how a particular constant should be generated
(eg, start with one easy-to-lower constant, shift by N, add another
easy-to-lower constant). If we constant-fold arbitrary vector
operations, that sequence of operations will be lost in many cases,
requiring backend heroics to infer how to materialize the constant.

I don't know to what extent the above is actually a valid objection,
though: in my testing, LLVM itself will fold together the operations
and in so doing lose the instructions on how to materialize the
constant. (And in some cases, Clang's IR generation code will do the
same, because it does IR-level constant folding as it goes.)

I guess it's a stupid question and I'm sorry for that, but I'm very new
to all this, so could you maybe explain a bit what you mean by
"materializing vector constants"? Does this just mean creating a vector
constant in memory? If it does, then my first guess would be to use
inline asm since we're talking about some specific target where user
"knows better".

Anyway, could you maybe point me to an example to play around of user
code specifying the materialisation process?

Example: on x86_64, v4si{-1, -1, -1, -1} + v4si{2, 0, 0, 0} can be
emitted as four instructions (pcmpeqd, mov, movd, paddd) totalling 17
bytes, or as one movaps (7 bytes) plus a 16 byte immediate; the former
is both smaller and a little faster, but LLVM is only able to produce
the latter today. LLVM is smart enough to produce good code for those
two constants in isolation, but not for v4si{1, -1, -1, -1}.

I don't quite get it. Any chance you could provide a small piece of code
illustrating your point?

Tom

[snip]

It compiles cleanly with both GCC and Clang. However, if I try to make
c constexpr, Clang tells me that operator+ on vectors is not
constexpr. I'm wondering if there's a reason for that. There are no
constrains from the Standard as these are intrinsics, so I see no
reason why we couldn't allow constexpr code benefit from SIMD.

We could certainly allow this. However, last time this was discussed,
an objection was raised: materializing arbitrary vector constants is
not cheap on all targets, and in some cases user code is written in
such a way as to describe how a particular constant should be generated
(eg, start with one easy-to-lower constant, shift by N, add another
easy-to-lower constant). If we constant-fold arbitrary vector
operations, that sequence of operations will be lost in many cases,
requiring backend heroics to infer how to materialize the constant.

I don't know to what extent the above is actually a valid objection,
though: in my testing, LLVM itself will fold together the operations
and in so doing lose the instructions on how to materialize the
constant. (And in some cases, Clang's IR generation code will do the
same, because it does IR-level constant folding as it goes.)

I guess it's a stupid question and I'm sorry for that, but I'm very new
to all this, so could you maybe explain a bit what you mean by
"materializing vector constants"? Does this just mean creating a vector
constant in memory? If it does, then my first guess would be to use
inline asm since we're talking about some specific target where user
"knows better".

Anyway, could you maybe point me to an example to play around of user
code specifying the materialisation process?

Example: on x86_64, v4si{-1, -1, -1, -1} + v4si{2, 0, 0, 0} can be
emitted as four instructions (pcmpeqd, mov, movd, paddd) totalling 17
bytes, or as one movaps (7 bytes) plus a 16 byte immediate; the former
is both smaller and a little faster, but LLVM is only able to produce
the latter today. LLVM is smart enough to produce good code for those
two constants in isolation, but not for v4si{1, -1, -1, -1}.

I don't quite get it. Any chance you could provide a small piece of code
illustrating your point?

Tom

>
>> [snip]
>>
>> It compiles cleanly with both GCC and Clang. However, if I try to make
>> c constexpr, Clang tells me that operator+ on vectors is not
>> constexpr. I'm wondering if there's a reason for that. There are no
>> constrains from the Standard as these are intrinsics, so I see no
>> reason why we couldn't allow constexpr code benefit from SIMD.
>
>
> We could certainly allow this. However, last time this was discussed,
> an objection was raised: materializing arbitrary vector constants is
> not cheap on all targets, and in some cases user code is written in
> such a way as to describe how a particular constant should be generated
> (eg, start with one easy-to-lower constant, shift by N, add another
> easy-to-lower constant). If we constant-fold arbitrary vector
> operations, that sequence of operations will be lost in many cases,
> requiring backend heroics to infer how to materialize the constant.
>
> I don't know to what extent the above is actually a valid objection,
> though: in my testing, LLVM itself will fold together the operations
> and in so doing lose the instructions on how to materialize the
> constant. (And in some cases, Clang's IR generation code will do the
> same, because it does IR-level constant folding as it goes.)

I guess it's a stupid question and I'm sorry for that, but I'm very new
to all this, so could you maybe explain a bit what you mean by
"materializing vector constants"? Does this just mean creating a vector
constant in memory?

No, it means creating a vector constant in a vector register (ideally
without loading it from somewhere in memory, since that tends to be slow
and have a large code size).

If it does, then my first guess would be to use

inline asm since we're talking about some specific target where user
"knows better".

I generally agree that the user should have to explicitly express that they
want their operation sequence to be preserved, either via inline asm or
some other mechanism we provide them.

Anyway, could you maybe point me to an example to play around of user

code specifying the materialisation process?

My observation was that such user code does not actually exist / work,
because the vector operations get folded together at the IR level. That is:
the objection to constant evaluation of vector operations in the frontend
does not appear to be a valid objection (perhaps it once was, before the
middle-end optimizers started optimizing vector operations, but not any
more).

Example: on x86_64, v4si{-1, -1, -1, -1} + v4si{2, 0, 0, 0} can be
> emitted as four instructions (pcmpeqd, mov, movd, paddd) totalling 17
> bytes, or as one movaps (7 bytes) plus a 16 byte immediate; the former
> is both smaller and a little faster, but LLVM is only able to produce
> the latter today. LLVM is smart enough to produce good code for those
> two constants in isolation, but not for v4si{1, -1, -1, -1}.

I don't quite get it. Any chance you could provide a small piece of code
illustrating your point?

Sure:

v4si f() {
    return v4si{-1,-1,-1,-1} + v4si{2,0,0,0};
}

v4si g() {
  v4si result;
  asm(R"(pcmpeqd %0, %0
        movl $2, %%eax
        movd %%eax, %%xmm1
        paddd %%xmm1, %0)" : "=x"(result) : : "eax", "xmm1");
  return result;
}

LLVM will materialize v4si{-1,-1,-1,-1} as pcmpeqd, and it will materialize
{2,0,0,0} as movl + movd. But the code it produces for f() is larger and
slower than the code for g() (which is the naive combination of what it did
for the two constants in isolation), because the vector operations got
folded together.

Anyway, could you maybe point me to an example to play around of user
code specifying the materialisation process?

My observation was that such user code does not actually exist / work,
because the vector operations get folded together at the IR level. That
is: the objection to constant evaluation of vector operations in the
frontend does not appear to be a valid objection (perhaps it once was,
before the middle-end optimizers started optimizing vector operations,
but not any more).

OK, so essentially that's a go on trying to implement it, right? I'll
probably take some time before I come up with a PR as I'm completely
unfamiliar with the code base.

> Example: on x86_64, v4si{-1, -1, -1, -1} + v4si{2, 0, 0, 0} can be
> emitted as four instructions (pcmpeqd, mov, movd, paddd) totalling
> 17 bytes, or as one movaps (7 bytes) plus a 16 byte immediate; the
> former is both smaller and a little faster, but LLVM is only able to
> produce the latter today. LLVM is smart enough to produce good code
> for those two constants in isolation, but not for v4si{1, -1, -1,
> -1}.

I don't quite get it. Any chance you could provide a small piece of
code illustrating your point?

Sure:

v4si f() {
    return v4si{-1,-1,-1,-1} + v4si{2,0,0,0};
}

v4si g() {
  v4si result;
  asm(R"(pcmpeqd %0, %0
        movl $2, %%eax
        movd %%eax, %%xmm1
        paddd %%xmm1, %0)" : "=x"(result) : : "eax", "xmm1");
  return result;
}

LLVM will materialize v4si{-1,-1,-1,-1} as pcmpeqd, and it will
materialize {2,0,0,0} as movl + movd. But the code it produces for f()
is larger and slower than the code for g() (which is the naive
combination of what it did for the two constants in isolation), because
the vector operations got folded together.

Aha, thanks, I get it now. It's interesting though that f() gets
implemented in a single movaps instruction: Compiler Explorer

Tom

>> Anyway, could you maybe point me to an example to play around of user
>> code specifying the materialisation process?
>
> My observation was that such user code does not actually exist / work,
> because the vector operations get folded together at the IR level. That
> is: the objection to constant evaluation of vector operations in the
> frontend does not appear to be a valid objection (perhaps it once was,
> before the middle-end optimizers started optimizing vector operations,
> but not any more).

OK, so essentially that's a go on trying to implement it, right? I'll
probably take some time before I come up with a PR as I'm completely
unfamiliar with the code base.

Yes, please go for it :slight_smile:

I agree, and I would only add that the original argument never seemed to justify (1) rejecting it in an constexpr context and (2) not doing constant-initialization of globals, especially static locals. As inefficient as loading an arbitrary vector constant from memory might be, performing dynamic lazy initialization is surely worse.

John.