Document to understand vectorized code

Dear all,

Greetings. I have generated a vectorized code from a C source file (attached). Is there any comprehensive tutorial/document to understand generated instructions or the semantics of the vectorized code ?


find_max.c (187 Bytes)

find_max_out.ll (5.36 KB)

Hi Sudakshina,

First, it helps if you can put your code in a godbolt snippet, like this [1]. It helps people in multiple ways (e.g., they don’t have to download files, they can see exactly what cmd arguments you used, they can
tweak the cmd arguments without having LLVM on their machine etc.).

Is there any comprehensive tutorial/document to understand generated instructions or the semantics of the vectorized code ?

This is quite generic, what is more specifically that you want to understand? Do you want to understand what each individual instruction does? Do you maybe understand that but
you don’t know what is the general method to generate, let’s say by hand, vectorized code (or more specifically, branching vectorized code). Or maybe, you want to understand
how LLVM generates this code, i.e., the inner workings of the vectorization passes.



Στις Κυρ, 6 Ιουν 2021 στις 6:17 π.μ., ο/η Sudakshina Dutta via llvm-dev <> έγραψε:

Dear Stefanos,

I want to understand how the generated code works. For example, the code that I generated using -Rpass=loop-vectorize -Rpass-analysis=loop-vectorize has many shufflevector and other instructions. I wanted to have a document (if there exists any) where an example on loop-vectorization is done with some explanation on the workflow of the code.


There’s some notes here but I didn’t look too closely at it. If you compile with -fno-discard-values names or use a debug build, some of the basic blocks will be named similar to the CFG diagram in this section

Dear Craig,

Thanks for your reply. It seems the control flow of the vectorized program is very different from that of the source code. Is there any document describing difference in control flow, the reason for using shufflevector, etc ?


Shufflevector is a powerful instruction that can be used to copy a scalar to every elements of a vector, concatenate smaller vectors into a larger vector, rearrange elements one or two vectors, etc. Without example code I can’t say for sure what the use of shufflevector in your case is for.

I think that you’re interested in a more general tutorial on how to vectorize code. I’m sorry I’m going to plug in my own articles [1] [2], but I’ve just tried to write a short version here and I basically almost rewrote them so there was
no point. For your case, I’d read at least the first half of “Residual Iterations” from Part 1 and most of part 2. The concept of residual iterations and reductions is something you’ll see over and over and it happens
in your example too. Let me explain the use of shufflevector pretty quick and then I’ll explain some specifics in your example. Also, before we move any further, I recommend that we turn off loop unrolling in your
example because it doesn’t add educational value and just clutters the code, so here’s your updated version [3].

— Shufflevector —

If you are not sure whether you understand what the instruction alone does, let me know. You can also play with this LLVM IR [4]. I recommend that you run this code and change the mask in line 24.
The reason it’s used in your example (see [3]) is because the compiler wants to create a vector which has arr[0] (referring to the C++ entity) in all its lanes. In line 4 (see [2]) it’s loaded into %0, in line 17 it inserts
it only to the 0th (first) lane of a vector (and for the time being, we don’t care about the other lanes which is why the first arg to insertelement is poison) and in line 18, it shuffles the value. The shuffle mask is the vector [0, 0, 0, 0]
(in short, zeroinitializer) which means in the 0th lane of the result vector (%minmax.ident.splat) insert the value of the 0th lane of the input vector (%minmax.ident.splatinsert). In the 1st lane of the result
vector, insert the value of the 0th lane of the input vector etc. So, all lanes of the result vector get the 0th lane of the input vector.

— Residual Iterations —

Assuming that you have understood the use of residual iterations, let’s see where it happens in your example. First of all, notice that in entry, there is a check whether n is bigger than 1, because if it’s not, there are no
iterations to do at all and in this case, the result, i.e., max, is arr[0], which is why it’s loaded so early. Check the phi at line 45. Then, if we make it to for.body.preheader, there is a check whether we have at least
4 iterations (the fact that this loop starts from one makes the computations somewhat more complicated but not a lot). Because if we don’t, then there’s no point going into the vectorized version. Then, skipping what the vectorized
code does exactly for now (but check how the iteration count is rounded down to the nearest multiple of 4 in line 15), check that after the vectorized code, the flow moves to middle.block,
which essentially checks whether there are any more iterations to do and if so, jumps to the serial version of the loop (this is where the residual iterations happen).

— Reduction —

In the article, I explain reductions using accumulation as an example, but in your example there’s another reduction, namely, finding the maximum element. Notice that this is also a reduction: new_value = max_of(old_value, data).
The reduction variable (acting as both old and new value) is max in your example. The operation is max_of and the continuously new data is arr[i].

Finding the maximum element is a commutative operation. For instance, to find the maximum element among 4, you can split them in two pairs any way you want, find the maximum of each pair independently
of the other pair (those two maximums are partial results and since their computation is independent of each other, it can be done in parallel), and finally find the maximum among these two maximums.

What happens in the vectorized code is that each lane finds a partial maximum. In line 27 it loads 4 elements and in the next two lines, it “compares” them with the previous 4 elements (%vec.phi, which keeps the 4 partial maximums which are updated
in every iteration). It computes a vector which has 4 maximums, one for each pair of lanes (between the new 4 elements in %wide.load, in %wide.load, and %vec.phi). You might want to run an example in paper to see the effect. Consider that you want to compute
the maximum of 9 numbers, in this order: 2, 10, 7, 3, 1, 8, 9, 11, 4

First, you get 2 (arr[0]) and broadcast it in a vector: <2, 2, 2, 2>. This is the starting value of %vec.phi. Then, you load 4 values: %wide.load = <10, 7, 3, 1>
Then, you do an element-wise comparison of the two and for each pair of lanes, you keep the maximum. In this case, 10 > 2, 7 > 2, 3 > 2 but 1 < 2. So, %5 = <10, 7, 3, 2>, which becomes the new %vec.phi.
Next iteration, you load another 4: %wide.load = <8, 9, 11, 4> and you compare with (the new) %vec.phi: 8 < 10, 9 > 7, 11 > 3, 4 > 2. So, %5 = <10, 9, 11, 4>. Now, you have computed 4 partial maximums
and you only need to find the maximum among those 4 to find the total maximum. That happens in line 35 and you’re done.

Hope this helps! Let me know if there are any questions.



Στις Δευ, 7 Ιουν 2021 στις 8:49 π.μ., ο/η Craig Topper <> έγραψε:

Dear Stefanos,

I shall go through this thoroughly and let you know.

Thanks a lot.

Dear Stefanos,

Sorry for getting back to you so late. Can you help with some document (other than language reference manual) the purpose of using shufflevector ?


Hi Sudakshina,

I’m not sure what you’re looking for. Assuming that you have understood what the instruction does, the purpose of using it in this specific case is what I explained above; broadcasting arr[0] to all the lanes of a vector.
The purpose of using it in general is not a single one, as Crag said (quoting):

Shufflevector is a powerful instruction that can be used to copy a scalar to every elements of a vector, concatenate smaller vectors into a larger vector, rearrange elements one or two vectors, etc.

You can see that your example falls in the first case he described. If you want a document showcasing examples of other cases, then I’m sorry, I don’t have such a thing.

But are you sure you understand what the instruction does (again, not why someone decided to use it, but what it does) ? If not, I can try to explain along with the supporting godbolt I sent previously.


Στις Δευ, 14 Ιουν 2021 στις 4:16 μ.μ., ο/η Sudakshina Dutta <> έγραψε:

Dear Stefanos,

I understood it.

Thanks a lot for your help.