generate vectorized code

My question is:
How do I make clang to generate assembly with vector instruction for my target?

The back story is:
I’ve added a few vector instructions to my target and confirmed that they are used by running my code on the test below and using a following command:

opt i.esencia.ll -S -march=esencia -mcpu=esencia -loop-vectorize | llc -mcpu=esencia -o i.esencia.s

target datalayout = “E-m:e-p:32:32-i64:32-f64:32-v64:32-v128:32-a:0:32-n32”
target triple = “esencia”

; Function Attrs: nounwind uwtable
define i32 @main() {
entry:
%z = alloca <4 x i32>
%a = alloca <4 x i32>
%b = alloca <4 x i32>
%a.l = load <4 x i32>* %a
%b.l = load <4 x i32>* %b
%z.l = add <4 x i32> %a.l, %b.l
store <4 x i32> %z.l, <4 x i32>* %z
ret i32 0
}

Now I’m trying to run clang and vectorize the following test:

#define N 16

int main () {

int a[N], b[N];
int c[N];

for (int i = 0; i < N; ++i)
c[i] = a[i] + b[i];

int sum=0;
for (int i = 0; i < N; ++i)
sum += c[i];

return sum;
}

Here are the command lines I tried:

clang -S test.c --target=esencia -fvectorize -o test.esencia.s

clang -S test.c --target=esencia -fvectorize -fslp-vectorize-aggressive -o test.esencia.s -fslp-vectorize

clang -S test.c --target=esencia -fvectorize -fslp-vectorize-aggressive -o test.esencia.s -fslp-vectorize -fno-lax-vector-conversions

Unfortunately nothing worked. Can someone help me out? I can’t really figure out why this is not working.

Any help is appreciated.

Hi Rail,

Two hints to begin with:

  1. Makes sure you example is vectorized on X86 for example
  2. Is your target correctly overriding the TTI (declaring the vector register size for example) so that the vectorizer can kicks-in (see X86TTIImpl::getRegisterBitWidth for instance). Alternatively you can test the SLP vectorizer by passing to clang: -mllvm -slp-max-reg-size -mllvm 512 (I don’t see an equivalent option for the loop vectorizer though).

Hi Rail,

Two hints to begin with:

1) Makes sure you example is vectorized on X86 for example
2) Is your target correctly overriding the TTI (declaring the vector
register size for example) so that the vectorizer can kicks-in (see
X86TTIImpl::getRegisterBitWidth for instance). Alternatively you can test
the SLP vectorizer by passing to clang: -mllvm -slp-max-reg-size -mllvm 512
(I don't see an equivalent option for the loop vectorizer though).

Well, it sort of worked. I added a getRegisterBitWidth(...) but then I got

this error:

fatal error: error in backend: Cannot select: 0x5e949a8: v4i32 =
BUILD_VECTOR 0x5e91ae8, 0x5e91ae8, 0x5e91ae8, 0x5e91ae8 [ORD=16] [ID=16]
  0x5e91ae8: i32 = Constant<0> [ID=5]
  0x5e91ae8: i32 = Constant<0> [ID=5]
  0x5e91ae8: i32 = Constant<0> [ID=5]
  0x5e91ae8: i32 = Constant<0> [ID=5]

What am I missing?

Any help is appreciated.

I means that you have a vectorized IR that reached your backend, but your backend is not ready to deal with all the vector constructs in SelectionDAG.
You need to express how to legalize/select the BUILD_VECTOR in SelectionDAG to instructions that your target supports. You can look at what other targets are doing.

Thanks for the reply. Do you mind pointing out the files I need to look at?
I looked at X86SelectionDAGInfo.cpp as well as ARMSelectionDAGInfo.cpp but
couldn't find anything relevant.

I think I understand that I need to implement a LowerBUILD_VECTOR, however
I'm struggling to understand how to do it. I did look at other targets and
I'm not very clear on what they are doing, as I'm not very experience with
LLVM as well as practical compilers (I did take a class in college but as
I'm understanding now, there is a giant difference between theory and
practice) At the moment my target has 3 very simple instructions, vector
add, vector load, and vector store, all of the elements of from the vector
are integers and 32 bits wide. Can someone at least point me in the right
direction on how to start implementing LowerBUILD_VECTOR?

Any help is appreciated.

So I've added setOperationAction(ISD::BUILD_VECTOR, MVT::v4i32,
Expand); to my code but that generated a following error:

fatal error: error in backend: Cannot select: 0x6a84dc8: i32 =
extract_vector_elt 0x6a85388, 0x6a813b0 [ORD=9] [ID=16]
  0x6a85388: v4i32 = add 0x6a81098, 0x6a81e00 [ORD=8] [ID=15]
    0x6a81098: v4i32 = add 0x6a81bf0, 0x6a84168 [ORD=6] [ID=12]
      0x6a81bf0: v4i32,ch = CopyFromReg 0x6a2b7f0, 0x6a819e0 [ORD=5] [ID=8]
        0x6a819e0: v4i32 = Register %vreg4 [ID=1]
      0x6a84168: v4i32 = vector_shuffle 0x6a81bf0, 0x6a857a8<2,3,u,u>
[ORD=5] [ID=10]
        0x6a81bf0: v4i32,ch = CopyFromReg 0x6a2b7f0, 0x6a819e0 [ORD=5]
[ID=8]
          0x6a819e0: v4i32 = Register %vreg4 [ID=1]
        0x6a857a8: v4i32 = undef [ID=2]
    0x6a81e00: v4i32 = vector_shuffle 0x6a81098, 0x6a857a8<1,u,u,u> [ORD=7]
[ID=14]
      0x6a81098: v4i32 = add 0x6a81bf0, 0x6a84168 [ORD=6] [ID=12]
        0x6a81bf0: v4i32,ch = CopyFromReg 0x6a2b7f0, 0x6a819e0 [ORD=5]
[ID=8]
          0x6a819e0: v4i32 = Register %vreg4 [ID=1]
        0x6a84168: v4i32 = vector_shuffle 0x6a81bf0, 0x6a857a8<2,3,u,u>
[ORD=5] [ID=10]
          0x6a81bf0: v4i32,ch = CopyFromReg 0x6a2b7f0, 0x6a819e0 [ORD=5]
[ID=8]
            0x6a819e0: v4i32 = Register %vreg4 [ID=1]
          0x6a857a8: v4i32 = undef [ID=2]
      0x6a857a8: v4i32 = undef [ID=2]
  0x6a813b0: i32 = Constant<0> [ID=3]
In function: main

Then I've added
setOperationAction(ISD::EXTRACT_VECTOR_ELT, MVT::i32, Expand);

but I still got the same error. So removed
setOperationAction(ISD::EXTRACT_VECTOR_ELT, MVT::i32, Expand); and
added

setOperationAction(ISD::EXTRACT_VECTOR_ELT, MVT::v4i32, Expand);

which produced a following error:

fatal error: error in backend: Cannot select: 0x7389250: v4i32 =
vector_shuffle 0x73884e8, 0x738cbf8<1,u,u,u> [ORD=7] [ID=15]
  0x73884e8: v4i32 = add 0x7389040, 0x738b5b8 [ORD=6] [ID=13]
    0x7389040: v4i32,ch = CopyFromReg 0x73327f0, 0x7388e30 [ORD=5] [ID=9]
      0x7388e30: v4i32 = Register %vreg4 [ID=1]
    0x738b5b8: v4i32 = vector_shuffle 0x7389040, 0x738cbf8<2,3,u,u> [ORD=5]
[ID=11]
      0x7389040: v4i32,ch = CopyFromReg 0x73327f0, 0x7388e30 [ORD=5] [ID=9]
        0x7388e30: v4i32 = Register %vreg4 [ID=1]
      0x738cbf8: v4i32 = undef [ID=2]
  0x738cbf8: v4i32 = undef [ID=2]
In function: main

Then I'v added setOperationAction(ISD::VECTOR_SHUFFLE, MVT::v4i32,
Expand);

and then my clang just hang. There is no error, no warning clang just sits
there and nothing happens.

I'm doing a lot of guess work in trying to understand what is going on. I
would really appreciate any help on this.

I think you created a cycle, this is easy to do with SelectionDAG :slight_smile:
Basically SelecitonDAG will iterate until it does not see anything to change. So if you insert a transformation on a pattern A, that generates pattern B, while you have another transformation that matches B and generates somehow A, you run into an infinite loop.

Here is how I started with SelectionDAG:

  • small IR (bugpoint can help)
  • the magic flag: -debug
  • read the output of SelectionDAG debugging (especially with cycles)
  • matching the log to source code
  • single stepping in a debugger sometimes.

Also: try to run your experiments with llc so you can easily tweak the input IR to SelectionDAG.

I think you created a cycle, this is easy to do with SelectionDAG :slight_smile:
Basically SelecitonDAG will iterate until it does not see anything to
change. So if you insert a transformation on a pattern A, that generates
pattern B, while you have another transformation that matches B and
generates somehow A, you run into an infinite loop.

I'm doing a lot of guess work in trying to understand what is going on. I
would really appreciate any help on this.

Here is how I started with SelectionDAG:

- small IR (bugpoint can help)
- the magic flag: -debug
- read the output of SelectionDAG debugging (especially with cycles)
- matching the log to source code
- single stepping in a debugger sometimes.

Also: try to run your experiments with llc so you can easily tweak the
input IR to SelectionDAG.

--
Mehdi

I ran a very simple test using llc and the following .ll file
target datalayout = "E-m:e-p:32:32-i64:32-f64:32-v64:32-v128:32-a:0:32-n32"
target triple = "esencia"

; Function Attrs: nounwind uwtable
define i32 @main() {
entry:
   %z = alloca <4 x i32>
   %a = alloca <4 x i32>
   %b = alloca <4 x i32>
   %a.l = load <4 x i32>* %a
   %b.l = load <4 x i32>* %b
   %z.l = add <4 x i32> %a.l, %b.l
   store <4 x i32> %z.l, <4 x i32>* %z
   ret i32 0
}

The test ran successfully (by successfully I mean genration of correct
assembly for my target) without any modifications to the code, i.e. I
didn't have to add any
  setOperationAction(ISD::BUILD_VECTOR, MVT::v4i32, Expand);
  setOperationAction(ISD::EXTRACT_VECTOR_ELT, MVT::v4i32, Expand);
  setOperationAction(ISD::VECTOR_SHUFFLE, MVT::v4i32, Expand);

In other words I left the code as is.

However if I use a .c code and run it through clang, I don't see any vector
instructions. I'm puzzled. What am I doing wrong? There seems to be a step
missing, the one that will generate vectorized IR, but I can't seem to find
how to do it.

Any help on this is really appreciated.

Yes this IR does not build or shuffle any vector. Try to write a function that takes 8 ints and a pointer to a <4xi32>, builds two vectors with the 8 ints, sum them, and store the result to the pointer.

Try: clang -O3 -emit-llvm -S test.c

Yes this IR does not build or shuffle any vector. Try to write a function
that takes 8 ints and a pointer to a <4xi32>, builds two vectors with the 8
ints,

This might sound like a dumb question, but how does one build a vector of
ints out of regular ints in IR?

See: http://llvm.org/docs/LangRef.html#vector-operations

In short, the IR has “insertelement”, which maps to “INSERT_VECTOR_ELT” in SDAG and “extractelement”, which maps to “EXTRACT_VECTOR_ELT” in SDAG.

I usually find good example by grepping in the lit tests. Another way is to write the function in clang, and run it with -O3 -emit-llvm -S to get a good starting point.

Yes this IR does not build or shuffle any vector. Try to write a function

that takes 8 ints and a pointer to a <4xi32>, builds two vectors with the 8
ints,

This might sound like a dumb question, but how does one build a vector of
ints out of regular ints in IR?

See: http://llvm.org/docs/LangRef.html#vector-operations

In short, the IR has "insertelement", which maps to "INSERT_VECTOR_ELT" in
SDAG and "extractelement", which maps to "EXTRACT_VECTOR_ELT" in SDAG.

I usually find good example by grepping in the lit tests. Another way is
to write the function in clang, and run it with -O3 -emit-llvm -S to get a
good starting point.

I tried using clang test.c -O3 -emit-llvm -S, but the only I didn't see any
of the insertvectorelt or extractvectorelt. I'm wondering how does one
trigger vector operations?

Below is the test.c file. It seemed to me like a good candidate for
vectorization, however nothing happened. I would really appreciate if you
could point me in the right
direction with respect to vector generation.

Any help is appreciated.

Forgot to attach a C file. Here it is:

#define N 32

int main () {

  int a[N], b[N];
  int c[N];

  for (int i = 0; i < N; ++i)
       c[i] = a[i] + b[i];

  int sum=0;
  for (int i = 0; i < N; ++i)
       sum += c[i];

  return sum;
}

I see vectorization happening on this example (see below).

This will be vectorized without any insertelement, here is a few lines extracted from the output of clang on this code:

%wide.load8.6 = load <4 x i32>* %48, align 16, !tbaa !2
%49 = add nsw <4 x i32> %wide.load8.6, %wide.load.6
%50 = getelementptr inbounds [32 x i32]* %c, i64 0, i64 24
%51 = bitcast i32* %50 to <4 x i32>*
store <4 x i32> %49, <4 x i32>* %51, align 16, !tbaa !2

Because you didn’t write the example as I described it, i.e. taking integer, doing a few arithmetic and writing result to contiguous memory, the vectorizer will be able to load directly vectors from memory, operates on them, and store the results. For example try with the following C code:

void foo (int a1, int a2, int a3, int a4, int b1, int b2, int b3, int b4, int *res) {
res[0] = a1 + b1 * 2;
res[1] = a2 + b2 * 2;
res[2] = a3 + b3 * 2;
res[3] = a4 + b4 * 2;
}

That’s for the clang part, you can look at the vectorizer lit test to have examples of IR before/after vectorization.

I see vectorization happening on this example (see below).

Any help is appreciated.

--
Mehdi

--
Rail Shafigulin
Software Engineer
Esencia Technologies

Forgot to attach a C file. Here it is:

#define N 32

int main () {

  int a[N], b[N];
  int c[N];

  for (int i = 0; i < N; ++i)
       c[i] = a[i] + b[i];

  int sum=0;
  for (int i = 0; i < N; ++i)
       sum += c[i];

  return sum;
}

This will be vectorized without any insertelement, here is a few lines
extracted from the output of clang on this code:

  %wide.load8.6 = load <4 x i32>* %48, align 16, !tbaa !2
  %49 = add nsw <4 x i32> %wide.load8.6, %wide.load.6
  %50 = getelementptr inbounds [32 x i32]* %c, i64 0, i64 24
  %51 = bitcast i32* %50 to <4 x i32>*
  store <4 x i32> %49, <4 x i32>* %51, align 16, !tbaa !2

Hmm... It didn't work for me. Maybe because I'm running an older version of
clang, 3.5 to be exactly. For now I'm stuck with it and can't switch to a
newer version.

Because you didn't write the example as I described it, i.e. taking
integer, doing a few arithmetic and writing result to contiguous memory,
the vectorizer will be able to load directly vectors from memory, operates
on them, and store the results. For example try with the following C code:

void foo (int a1, int a2, int a3, int a4, int b1, int b2, int b3, int b4,
int *res) {
  res[0] = a1 + b1 * 2;
  res[1] = a2 + b2 * 2;
  res[2] = a3 + b3 * 2;
  res[3] = a4 + b4 * 2;
}

That's for the clang part, you can look at the vectorizer lit test to have
examples of IR before/after vectorization.

--
Mehdi

I misunderstood you. I thought asked me to create an IR with insertelement
in it. I'm going to try your example and see what happens.

I see vectorization happening on this example (see below).

Any help is appreciated.

--
Mehdi

--
Rail Shafigulin
Software Engineer
Esencia Technologies

Forgot to attach a C file. Here it is:

#define N 32

int main () {

  int a[N], b[N];
  int c[N];

  for (int i = 0; i < N; ++i)
       c[i] = a[i] + b[i];

  int sum=0;
  for (int i = 0; i < N; ++i)
       sum += c[i];

  return sum;
}

This will be vectorized without any insertelement, here is a few lines
extracted from the output of clang on this code:

  %wide.load8.6 = load <4 x i32>* %48, align 16, !tbaa !2
  %49 = add nsw <4 x i32> %wide.load8.6, %wide.load.6
  %50 = getelementptr inbounds [32 x i32]* %c, i64 0, i64 24
  %51 = bitcast i32* %50 to <4 x i32>*
  store <4 x i32> %49, <4 x i32>* %51, align 16, !tbaa !2

Because you didn't write the example as I described it, i.e. taking
integer, doing a few arithmetic and writing result to contiguous memory,
the vectorizer will be able to load directly vectors from memory, operates
on them, and store the results. For example try with the following C code:

void foo (int a1, int a2, int a3, int a4, int b1, int b2, int b3, int b4,
int *res) {
  res[0] = a1 + b1 * 2;
  res[1] = a2 + b2 * 2;
  res[2] = a3 + b3 * 2;
  res[3] = a4 + b4 * 2;
}

That's for the clang part, you can look at the vectorizer lit test to have
examples of IR before/after vectorization.

--
Mehdi

Just out of curiosity how did you know that your foo code will produce
vectorized code? I tried code similar to yours without any multiplication
and no vectors were generated.

I read the source code for the SLP Vectorizer :wink:
(other than looking at unit tests, this is another good way of learning of LLVM works)

It is a matter of cost model: there need to be a few arithmetic instruction to balance the cost of building a vector.

Here is how I started with SelectionDAG:

- small IR (bugpoint can help)

Did you mean a break poing?

- the magic flag: -debug

- read the output of SelectionDAG debugging (especially with cycles)
- matching the log to source code

What log are you talking about?