Vectorization of pointer PHI nodes

This is almost ideal for SLP vectorization, except for two problems:

  1. We have 4 stores to consecutive locations, but the last element is the constant zero, and not an additional SUB. At the moment we don’t have support for idempotence operations, but this is something that we should add.

  2. The values that we are subtracting come from 3 loads. We usually load 4 elements from memory, or scalarize the inputs (we don’t support masked loads on AVX512).

Do you know if the GCC SLP Vectorizer vectorizes this, or is it their Loop Vectorizer ?

Thanks,
Nadav

1. We have 4 stores to consecutive locations, but the last element is the
constant zero, and not an additional SUB. At the moment we don’t have
support for idempotence operations, but this is something that we should
add.

The fourth write is not necessary for GCC to vectorize it (nor was in the
original code), but it was a result of CReduce's attempt to converge when
running ARM's GCC and inspecting the right sequence of vector instructions.
(btw, CReduce is great!).

In this case, shouldn't the vector operations to just add an undef to the
fourth lane? Would back-ends recognize it as a AVX/NEON/AltiVec
instruction, or just try to re-linearise?

2. The values that we are subtracting come from 3 loads. We usually load 4

elements from memory, or scalarize the inputs (we don’t support masked
loads on AVX512).

That is a more complicated issue, but we can get away with it if we, in a
first implementation, only allow the same number of reads and writes on
each loop. In that case, if the operations on the independent variables are
identical, than it means the loop can be simplified by multiplying the
induction range by N and reducing the number of load/sub/store lanes to
one, in which case, loop vectorization becomes trivial.

Do you know if the GCC SLP Vectorizer vectorizes this, or is it their Loop

Vectorizer ?

Good question. What vectorizer does the "-ftree-vectorizer" turns on?
Because if I use "-fno-tree-vectorize", the code remains scalar.

cheers,
--renato

Renato, can you post the c code for the function and the assembly that gcc produces?

Your initial example could be well handled by vectorization of strided loops (and the mentioning of VLD3(.8?)/VST3(.8?) lead me to assume that this is what happened). But the LLVM-IR you sent has a store of 0 in there :wink: and strides by 4.

Thanks,
Arnold

Vectorization of strided loops:

I am using float as the example otherwise would get too long.

void f(float * restrict read, float * restrict write) {
  for (int i = 0; i < 256; i++) {
    float a1 = *read++ * 3.0;
    float a2 = *read++ * 4.0;
    float a3 = *read++ * 5.0;

    *write++ = a1;
    *write++ = a2;
    *write++ = a3;
  }

recognized as

  for (int i = 0; i < 256; i +=3) {
    float a1 = *read[i] * 3.0;
    float a2 = *read[i+1]* 4.0;
    float a3 = *read[i+2] * 5.0;

    write[i] = a1;
    write[i+1] = a2;
    write[i+2] = a3;
  }

=> loop vectorize with a factor of 4, recognizing that after we vector-unroll the loop by four the scattered accesses from different lines (read[i]..read[i+9+2]) are consecutive and we can efficiently vectorized these accesses (3 vector loads plus interleaves which on arm we can do with VLD3.8):

  for (int i = 0; i < 256; i +=12) {
    float a1 = *read[i] * 3.0;
    float a1_2 = *read[i+3] * 3.0;
    float a1_3 = *read[i+6] * 3.0;
    float a1_4 = *read[i+9] * 3.0

    float a2 = *read[i+1]* 4.0;
    float a2_2 = *read[i+3+1]* 4.0;
    …

    float a3 = *read[i+2] * 5.0;
    float a3_2 = *read[i+3+2] * 5.0;

    write[i] = a1;
    write[i+3] = a1_2;
    …

    write[i+1] = a2;
    write[i+1+3] = a2_2;
    ...
  }

VLD3.f32 {a1..a1_4, a2..a2_4, a3..3_4} [read+i]
a1..a1_4 = VMUL a1..a1_4, #3.0
a2..a2_4 = VMUL a2..a2_4, #4.0
a3..a3_4 = VMUL a3..a3_4, #5.0
VST3.f32 {a1..a1_4, a2..a2_4, a3..3_4} [read+i]

Renato, can you post the c code for the function and the assembly that gcc
produces?

Attached.

Your initial example could be well handled by vectorization of strided

loops (and the mentioning of VLD3(.8?)/VST3(.8?) lead me to assume that
this is what happened). But the LLVM-IR you sent has a store of 0 in there
:wink: and strides by 4.

I think so. Ignore the last write, it was bogus. (but don't ignore the fact
that GCC vectorized it anyway with vst4!).

By running GCC with -ftree-vectorizer-verbose=1 I got:

test.c:11: note: create runtime check for data references DELTA and
*WRITE_30
test.c:11: note: create runtime check for data references *READ_29 and
*WRITE_30
test.c:11: note: created 2 versioning for alias checks.
test.c:11: note: === vect_do_peeling_for_loop_bound ===Setting upper bound
of nb iterations for epilogue loop to 14
test.c:11: note: LOOP VECTORIZED.

The result is a very concise and very dense code:

vld1.8 {d28, d29}, [r5]
vld3.8 {d16, d18, d20}, [r9]!
vld3.8 {d17, d19, d21}, [r9]
vmvn q3, q8
vmvn q15, q9
vmvn q8, q10
vsub.i8 q11, q3, q14
vsub.i8 q12, q15, q14
vsub.i8 q13, q8, q14
vst3.8 {d22, d24, d26}, [r8]!
vst3.8 {d23, d25, d27}, [r8]

cheers,
--renato

test.c (398 Bytes)

Hi Renato,

As far as I know, -ftree-vectorizer will enable both loop vectorization and slp vectorization.-ftree-slp-vectorize will do slp vectorization but it will be enabled by -free-vectorizer automatically.

-Yi

Yes, that looks like it is doing strided access loop vectorization (see: Auto-vectorization of interleaved data for SIMD, "http://dl.acm.org/citation.cfm?id=1133997”\)