LLVM loop vectorizer

Hello.
     I am trying to vectorize a CSR SpMV (sparse matrix vector multiplication) procedure but the LLVM loop vectorizer is not able to handle such code.
     I am using cland and llvm version 3.4 (on Ubuntu 12.10). I use the -fvectorize option with clang and -loop-vectorize with opt-3.4 .
     The CSR SpMV function is inspired from c - slow sparse matrix vector product (CSR) using open mp - Stack Overflow (I can provide the exact code samples used).

     Basically the problem is the loop vectorizer does NOT work with if inside loop (be it 2 nested loops or a modification of SpMV I did with just 1 loop - I can provide the exact code) changing the value of the accumulator z. I can sort of understand why LLVM isn't able to vectorize the code.
     However, at http://llvm.org/docs/Vectorizers.html#if-conversion it is written:
             <<The Loop Vectorizer is able to "flatten" the IF statement in the code and generate a single stream of instructions.
               The Loop Vectorizer supports any control flow in the innermost loop.
               The innermost loop may contain complex nesting of IFs, ELSEs and even GOTOs.>>
      Could you please tell me what are these lines exactly trying to say.

     Could you please tell me what algorithm is the LLVM loop vectorizer using (maybe the algorithm is described in a paper) - I currently found only 2 presentations on this topic: http://llvm.org/devmtg/2013-11/slides/Rotem-Vectorization.pdf and https://archive.fosdem.org/2014/schedule/event/llvmautovec/attachments/audio/321/export/events/attachments/llvmautovec/audio/321/AutoVectorizationLLVM.pdf .

   Thank you very much,
       Alex

Hi Alex,

Example from the link you provided looks like this:

for (i=0; i<M; i++ ){ 
    z[i]=0;
    for (ckey=row_ptr[i]; ckey<row_ptr[i+1]; ckey++) {
      z[i] += data[ckey]*x[colind[ckey]];
    }              
  }

Is it the loop you are trying to vectorize? I don’t see any ‘if’ inside the innermost loop.

But anyway, here vectorizer might have following troubles:

  1. iteration count of the innermost loop is unknown.
  2. Gather accesses ( a[b[i]] ). With AVX512 set of instructions it’s possible to generate efficient code for such case, but a) I think it’s not supported yet, b) if this ISA isn’t available, then vectorized code would need to ‘manually’ gather scalar values to vector, which might be slow (and thus, vectorizer might decide to leave the code scalar).

And here is a list of papers vectorizer is based on:

// The reduction-variable vectorization is based on the paper:
// D. Nuzman and R. Henderson. Multi-platform Auto-vectorization.
//
// Variable uniformity checks are inspired by:
// Karrenberg, R. and Hack, S. Whole Function Vectorization.
//
// The interleaved access vectorization is based on the paper:
// Dorit Nuzman, Ira Rosen and Ayal Zaks. Auto-Vectorization of Interleaved
// Data for SIMD
//
// Other ideas/concepts are from:
// A. Zaks and D. Nuzman. Autovectorization in GCC-two years later.
//
// S. Maleki, Y. Gao, M. Garzaran, T. Wong and D. Padua. An Evaluation of
// Vectorizing Compilers.

And probably, some of the parts are written from scratch with no reference to a paper.

The presentations you found are a good starting point, but while they’re still good from getting basics of the vectorizer, they are a bit outdated now in a sense that a lot of new features has been added since then (and bugs fixed:) ). Also, I’d recommend trying a newer LLVM version - I don’t think it’ll handle the example above, but it would be much more convenient to investigate why the loop isn’t vectorized and fix vectorizer if we figure out how.

Best regards,
Michael

Resending to new llvm-dev…

Hello, Michael.
     I come back to this older email.

     I am trying to implement coalescing/collapsing of nested loops. This would be clearly beneficial for the loop vectorizer, also.
     I'm normally planning to start modifying the LLVM loop vectorizer to add loop coalescing of the LLVM language.

     Are you aware of a similar effort on loop coalescing in LLVM (maybe even a different LLVM pass, not related to the LLVM loop vectorizer)?

   Thank you,
     Alex

Hello, Michael.
     I come back to this older email. Sorry if you receive it again.

     I am trying to implement coalescing/collapsing of nested loops. This would be clearly beneficial for the loop vectorizer, also.
     I'm normally planning to start modifying the LLVM loop vectorizer to add loop coalescing of the LLVM language.

     Are you aware of a similar effort on loop coalescing in LLVM (maybe even a different LLVM pass, not related to the LLVM loop vectorizer)?

   Thank you,
     Alex

Hi,

What is the performance gain you expect? Can you share modified source(s) or IR? It might be good to have such 'test cases’ that can show potential performance suites eg. as a separate test suite. And this might be a good start.

Since you are looking at a loop transformation you might find some inspiration from Adam’s loop distribution pass http://reviews.llvm.org/D8831

Good luck!

Gerolf

Hi Alex,

I’m not aware of efforts on loop coalescing in LLVM, but probably polly can do something like this. Also, one related thought: it might be worth making it a separate pass, not a part of loop vectorizer. LLVM already has several ‘utility’ passes (e.g. loop rotation), which primarily aims at enabling other passes.

Thanks,
Michael

Hello.
     Mikhail, I come back to this older thread.
     I need to do a few changes to LoopVectorize.cpp.

     One of them is related to figuring out the exact C source line and column number of the loops being vectorized. I've noticed that a recent version of LoopVectorize.cpp prints imprecise debug info for vectorized loops such as, for example, the location of a character of an assignment statement inside the respective loop.
     It would help me a lot in my project to find the exact C source line and column number of the first and last character of the loop being vectorized. (imprecise location would make my life more complicated).
     Is this feasible? Or are there limitations at the level of clang of retrieving the exact C source line and column number location of the beginning and end of a loop (it can include indent chars before and after the loop)?
     (I've seen other examples with imprecise location such as the "Reading diagnostics" chapter in the book isbn:1782166939 - Căutare Google .)

     Note: to be able to retrieve the debug info from the C source file we require to run clang with -Rpass* options, as discussed before. Otherwise, if we run clang first, then opt on the resulting .ll file which runs LoopVectorize, we lose the C source file debug info (DebugLoc class, etc) and obtain the debug info from the .ll file. An example:
         clang -O3 3better.c -arch=mips -ffast-math -Rpass=debug -Rpass=loop-vectorize -Rpass-analysis=loop-vectorize -S -emit-llvm -fvectorize -mllvm -debug -mllvm -force-vector-width=16 -save-temps

   Thank you,
     Alex

Hi Alex,

I'm not aware of efforts on loop coalescing in LLVM, but probably polly can do
something like this. Also, one related thought: it might be worth making it a separate
pass, not a part of loop vectorizer. LLVM already has several 'utility' passes (e.g.
loop rotation), which primarily aims at enabling other passes.

Thanks, Michael

Hello, Michael. I come back to this older email. Sorry if you receive it again.

I am trying to implement coalescing/collapsing of nested loops. This would be
clearly beneficial for the loop vectorizer, also. I'm normally planning to start
modifying the LLVM loop vectorizer to add loop coalescing of the LLVM language.

Are you aware of a similar effort on loop coalescing in LLVM (maybe even a different
LLVM pass, not related to the LLVM loop vectorizer)?

Thank you, Alex

With best regards, Alex Susu

Hi Alex,

Example from the link you provided looks like this:

>for (i=0; i<M; i++ ){ z[i]=0; for (ckey=row_ptr[i]; ckey<row_ptr[i+1];
ckey++) { z[i] += data[ckey]*x[colind[ckey]]; } }|

Is it the loop you are trying to vectorize? I don’t see any ‘if’ inside the
innermost loop.

I tried to simplify this code in the hope the loop vectorizer can take care of it
better: I linearized...

But anyway, here vectorizer might have following troubles: 1) iteration count of
the innermost loop is unknown. 2) Gather accesses ( a[b[i]] ). With AVX512 set of
instructions it’s possible to generate efficient code for such case, but a) I
think it’s not supported yet, b) if this ISA isn’t available, then vectorized
code would need to ‘manually’ gather scalar values to vector, which might be slow
(and thus, vectorizer might decide to leave the code scalar).

And here is a list of papers vectorizer is based on: // The reduction-variable
vectorization is based on the paper: // D. Nuzman and R. Henderson.
Multi-platform Auto-vectorization. // // Variable uniformity checks are inspired
by: // Karrenberg, R. and Hack, S. Whole Function Vectorization. // // The
interleaved access vectorization is based on the paper: // Dorit Nuzman, Ira
Rosen and Ayal Zaks. Auto-Vectorization of Interleaved // Data for SIMD // //
Other ideas/concepts are from: // A. Zaks and D. Nuzman. Autovectorization in
GCC-two years later. // // S. Maleki, Y. Gao, M. Garzaran, T. Wong and D. Padua.
An Evaluation of // Vectorizing Compilers. And probably, some of the parts are
written from scratch with no reference to a paper.

The presentations you found are a good starting point, but while they’re still
good from getting basics of the vectorizer, they are a bit outdated now in a
sense that a lot of new features has been added since then (and bugs fixed:) ).
Also, I’d recommend trying a newer LLVM version - I don’t think it’ll handle the
example above, but it would be much more convenient to investigate why the loop
isn’t vectorized and fix vectorizer if we figure out how.

Best regards, Michael

Thanks for the papers - these appear to be written in the header of the file
implementing the loop vect. tranformation (found at
"where-you-want-llvm-to-live"/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp ).

Hello. I am trying to vectorize a CSR SpMV (sparse matrix vector
multiplication) procedure but the LLVM loop vectorizer is not able to handle
such code. I am using cland and llvm version 3.4 (on Ubuntu 12.10). I use the
-fvectorize option with clang and -loop-vectorize with opt-3.4 . The CSR SpMV
function is inspired from
c - slow sparse matrix vector product (CSR) using open mp - Stack Overflow

(I can provide the exact code samples used).

Basically the problem is the loop vectorizer does NOT work with if inside loop
(be it 2 nested loops or a modification of SpMV I did with just 1 loop - I can
provide the exact code) changing the value of the accumulator z. I can sort of
understand why LLVM isn't able to vectorize the code. However,
athttp://llvm.org/docs/Vectorizers.html#if-conversionit is written: <<The Loop
Vectorizer is able to "flatten" the IF statement in the code and generate a
single stream of instructions. The Loop Vectorizer supports any control flow in
the innermost loop. The innermost loop may contain complex nesting of IFs,
ELSEs and even GOTOs.>> Could you please tell me what are these lines exactly
trying to say.

Could you please tell me what algorithm is the LLVM loop vectorizer using
(maybe the algorithm is described in a paper) - I currently found only 2
presentations on this
topic:http://llvm.org/devmtg/2013-11/slides/Rotem-Vectorization.pdfand
https://archive.fosdem.org/2014/schedule/event/llvmautovec/attachments/audio/321/export/events/attachments/llvmautovec/audio/321/AutoVectorizationLLVM.pdf

.

Thank you very much, Alex _______________________________________________ LLVM
Developers mailing list LLVMdev@cs.uiuc.edu
<mailto:LLVMdev@cs.uiuc.edu><mailto:LLVMdev@cs.uiuc.edu>http://llvm.cs.uiuc.edu

<http://llvm.cs.uiuc.edu/&gt;

Hi Alex,

I think the changes you want are actually not vectorizer related. Vectorizer just uses data provided by other passes.

What you probably might want is to look into routine Loop::getStartLoc() (see lib/Analysis/LoopInfo.cpp). If you find a way to improve it, patches are welcome:)

Thanks,
Michael

Try using:

#pragma clang loop vectorize(enable) interleave(enable)

just before the loop. It should force vectorization even if the cost
model doesn't like it.

This is not a final solution, but may give you a hint at what the
vectortizer thought it could do, and why it wasn't beneficial.

Try running benchmarks with the pragma on and off and see if it really wasn't.

If the vectorized code is better, the cost model was wrong, and we
should update it.

If the vectorized code is worse, and you know of a better way, please
fill a bug in Bugzilla[1] so that we can find what the vectorizer can
do to make it work.

Of course, if you can change the vectorizer yourself and submit
patches, that'd be even better. :slight_smile:

Please, provide the bug (or the patch proposals) with as much
information on costs and architecture as you can.

cheers,
--renato

[1] https://llvm.org/bugs/

Hi Alex,

This has been very recently fixed by Hal. See http://reviews.llvm.org/rL270771

Adam

Hello, Mikhail.
     I'm planning to do source-to-source transformation for loop vectorization.
     Basically I want to generate C (C++) code from C (C++) source code:
       - the code that is not vectorized remains the same - this would be simple to achieve if we can obtain precisely the source location of each statement;
       - the code that gets vectorized I want to translate in C code the parts that are sequential and generate SIMD intrinsics for my SIMD processor where normally it would generate vector instructions.
      I started looking at InnerLoopVectorizer::vectorize() and InnerLoopVectorizer::createEmptyLoop(). Not generating LLVM code but C/C++ code (with the help of LLVM intrinsics) is not trivial, but it should be reasonably simple to achieve.

     Would you advise for such an operation as the one described above? I guess doing this as a Clang phase (working on the source code) is not really a bad idea either, since I would have better control on source code, but I would need to reimplement the loop vectorizer algorithm that is currently implemented on LLVM code.

   Thank you,
     Alex

Hi Alex,

I think the changes you want are actually not vectorizer related. Vectorizer just uses
data provided by other passes.

What you probably might want is to look into routine Loop::getStartLoc() (see
lib/Analysis/LoopInfo.cpp). If you find a way to improve it, patches are welcome:)

Thanks, Michael

Hello. Mikhail, I come back to this older thread. I need to do a few changes to
LoopVectorize.cpp.

One of them is related to figuring out the exact C source line and column number of
the loops being vectorized. I've noticed that a recent version of LoopVectorize.cpp
prints imprecise debug info for vectorized loops such as, for example, the location
of a character of an assignment statement inside the respective loop. It would help
me a lot in my project to find the exact C source line and column number of the first
and last character of the loop being vectorized. (imprecise location would make my
life more complicated). Is this feasible? Or are there limitations at the level of
clang of retrieving the exact C source line and column number location of the
beginning and end of a loop (it can include indent chars before and after the loop)?
(I've seen other examples with imprecise location such as the "Reading diagnostics"
chapter in the book isbn:1782166939 - Căutare Google .)

Note: to be able to retrieve the debug info from the C source file we require to run
clang with -Rpass* options, as discussed before. Otherwise, if we run clang first,
then opt on the resulting .ll file which runs LoopVectorize, we lose the C source
file debug info (DebugLoc class, etc) and obtain the debug info from the .ll file. An
example: clang -O3 3better.c -arch=mips -ffast-math -Rpass=debug
-Rpass=loop-vectorize -Rpass-analysis=loop-vectorize -S -emit-llvm -fvectorize -mllvm
-debug -mllvm -force-vector-width=16 -save-temps

Thank you, Alex

Hi Alex,

I'm not aware of efforts on loop coalescing in LLVM, but probably polly can do
something like this. Also, one related thought: it might be worth making it a
separate pass, not a part of loop vectorizer. LLVM already has several 'utility'
passes (e.g. loop rotation), which primarily aims at enabling other passes.

Thanks, Michael

Hello, Michael. I come back to this older email. Sorry if you receive it again.

I am trying to implement coalescing/collapsing of nested loops. This would be
clearly beneficial for the loop vectorizer, also. I'm normally planning to start
modifying the LLVM loop vectorizer to add loop coalescing of the LLVM language.

Are you aware of a similar effort on loop coalescing in LLVM (maybe even a
different LLVM pass, not related to the LLVM loop vectorizer)?

Thank you, Alex

With best regards, Alex Susu

Hi Alex,

Example from the link you provided looks like this:

>for (i=0; i<M; i++ ){ z[i]=0; for (ckey=row_ptr[i];
ckey<row_ptr[i+1]; ckey++) { z[i] += data[ckey]*x[colind[ckey]]; } }|

Is it the loop you are trying to vectorize? I don’t see any ‘if’ inside the
innermost loop.

I tried to simplify this code in the hope the loop vectorizer can take care of
it better: I linearized...

But anyway, here vectorizer might have following troubles: 1) iteration count
of the innermost loop is unknown. 2) Gather accesses ( a[b[i]] ). With AVX512
set of instructions it’s possible to generate efficient code for such case,
but a) I think it’s not supported yet, b) if this ISA isn’t available, then
vectorized code would need to ‘manually’ gather scalar values to vector,
which might be slow (and thus, vectorizer might decide to leave the code
scalar).

And here is a list of papers vectorizer is based on: // The
reduction-variable vectorization is based on the paper: // D. Nuzman and R.
Henderson. Multi-platform Auto-vectorization. // // Variable uniformity
checks are inspired by: // Karrenberg, R. and Hack, S. Whole Function
Vectorization. // // The interleaved access vectorization is based on the
paper: // Dorit Nuzman, Ira Rosen and Ayal Zaks. Auto-Vectorization of
Interleaved // Data for SIMD // // Other ideas/concepts are from: // A.
Zaks and D. Nuzman. Autovectorization in GCC-two years later. // // S.
Maleki, Y. Gao, M. Garzaran, T. Wong and D. Padua. An Evaluation of //
Vectorizing Compilers. And probably, some of the parts are written from
scratch with no reference to a paper.

The presentations you found are a good starting point, but while they’re
still good from getting basics of the vectorizer, they are a bit outdated now
in a sense that a lot of new features has been added since then (and bugs
fixed:) ). Also, I’d recommend trying a newer LLVM version - I don’t think
it’ll handle the example above, but it would be much more convenient to
investigate why the loop isn’t vectorized and fix vectorizer if we figure out
how.

Best regards, Michael

Thanks for the papers - these appear to be written in the header of the file
implementing the loop vect. tranformation (found at
"where-you-want-llvm-to-live"/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
).

Hello. I am trying to vectorize a CSR SpMV (sparse matrix vector
multiplication) procedure but the LLVM loop vectorizer is not able to
handle such code. I am using cland and llvm version 3.4 (on Ubuntu 12.10).
I use the -fvectorize option with clang and -loop-vectorize with opt-3.4 .
The CSR SpMV function is inspired from
c - slow sparse matrix vector product (CSR) using open mp - Stack Overflow

(I can provide the exact code samples used).

Basically the problem is the loop vectorizer does NOT work with if inside
loop (be it 2 nested loops or a modification of SpMV I did with just 1 loop
- I can provide the exact code) changing the value of the accumulator z. I
can sort of understand why LLVM isn't able to vectorize the code. However,
athttp://llvm.org/docs/Vectorizers.html#if-conversionit is written: <<The
Loop Vectorizer is able to "flatten" the IF statement in the code and
generate a single stream of instructions. The Loop Vectorizer supports any
control flow in the innermost loop. The innermost loop may contain complex
nesting of IFs, ELSEs and even GOTOs.>> Could you please tell me what are
these lines exactly trying to say.

Could you please tell me what algorithm is the LLVM loop vectorizer using
(maybe the algorithm is described in a paper) - I currently found only 2
presentations on this
topic:http://llvm.org/devmtg/2013-11/slides/Rotem-Vectorization.pdfand
https://archive.fosdem.org/2014/schedule/event/llvmautovec/attachments/audio/321/export/events/attachments/llvmautovec/audio/321/AutoVectorizationLLVM.pdf

.

Thank you very much, Alex _______________________________________________
LLVM Developers mailing list LLVMdev@cs.uiuc.edu
<mailto:LLVMdev@cs.uiuc.edu><mailto:LLVMdev@cs.uiuc.edu>http://llvm.cs.uiuc.edu

<http://llvm.cs.uiuc.edu/&gt;

vectorization is a coordination from high level optimizations like
loop level stuff and low level target stuff. If you are still at the
source level, how do you plan to handle the actual lowering? In that
case you'll still always be at the mercy of another piece, which may
or may not be able to handle what you've done. (In theory your
transformation could be correct, but backend just not handle it)

Having said this - why not actually work on fixing the root of the
"problem" - that being the actual llvm passes which aren't doing what
you need. This would also likely be more robust and you can maintain
control over the whole experiment (compilation flow)

I get really annoyed when reviewing papers from academics who have
used source-to-source because they thought it was "easier". Short term
short-cuts aren't likely going to produce novel results..

Some related work: http://llvm.org/devmtg/2013-04/krzikalla-slides.pdf

Hi Alex,

Hello, Mikhail.
   I'm planning to do source-to-source transformation for loop vectorization.

Could you please share your reasoning on why you need to do it source-to-source? While I recognize that there might be external reasons to do it, I do think that working on IR is much easier.

   Basically I want to generate C (C++) code from C (C++) source code:
     - the code that is not vectorized remains the same - this would be simple to achieve if we can obtain precisely the source location of each statement;

If you work completely in front-end, without generating IR, then yes, it's probably true. But the most complicated part thought would be to check if vectorization is legal. Even in IR it's not a trivial task - if you want the same level of error-proofness as we have now, I'm afraid you'll end with just another IR internal to your transformation. For one, think about how would you handle memory aliasing.

If you do lower to IR first, then there is no "the code that is not vectorized remains the same" - it's already mutated by previous passes anyway. E.g. what if the loop was distributed/unrolled before vectorization?

     - the code that gets vectorized I want to translate in C code the parts that are sequential and generate SIMD intrinsics for my SIMD processor where normally it would generate vector instructions.
    I started looking at InnerLoopVectorizer::vectorize() and InnerLoopVectorizer::createEmptyLoop(). Not generating LLVM code but C/C++ code (with the help of LLVM intrinsics) is not trivial, but it should be reasonably simple to achieve.

What you suggest here is like writing a C backend and teach it to generate intrinsics for vector code (such backend existed some time ago btw). It should be doable, but I wouldn't call it simple:)

Thanks,
Michael

Hi, Mikhail.
     Please see answers embedded in the text below.

Hi Alex,

Hello, Mikhail. I'm planning to do source-to-source transformation for loop
vectorization.

Could you please share your reasoning on why you need to do it source-to-source? While
I recognize that there might be external reasons to do it, I do think that working on
IR is much easier.

     I'm currently implementing a back end for BPF (Berkeley Processor Frontend) + Connex SIMD processor (using register allocator, codegen, etc). Note that Connex is a research processor.
     Part of this work required us to also work with LoopVectorize.cpp, since we need to add memory transfers at vector loads and stores, specific branch instructions, etc.
But, in principle we also wish to generate readable C/C++ code, not assembly if possible, since we have developed an assembly library written in C++ . The most important reason is actually it allows me to generate C++ program for both the CPU (which can be BPF, but also ARM, x86, etc) and the Connex SIMD processor.

Basically I want to generate C (C++) code from C (C++) source code: - the code that
is not vectorized remains the same - this would be simple to achieve if we can obtain
precisely the source location of each statement;

If you work completely in front-end, without generating IR, then yes, it's probably
true. But the most complicated part though would be to check if vectorization is

     There is a project performing vectorization in the front-end: https://tu-dresden.de/die_tu_dresden/zentrale_einrichtungen/zih/forschung/projekte/scout/publications . It does perform loop unroll-and-jam for strip-mining, loop collapsing (http://www.t-systems-sfr.com/e/downloads/2010/vortraege/1Krzikalla.pdf) and what they call if-collection and register blocking.

legal. Even in IR it's not a trivial task - if you want the same level of
error-proofness as we have now, I'm afraid you'll end with just another IR internal to
your transformation.

     Indeed, vectorization in the front-end requires implementing quite a few loop transformations, which are normally already available in LLVM IR. But they should be simpler to implement since they target just vectorization.

For one, think about how would you handle memory aliasing.

If you do lower to IR first, then there is no "the code that is not vectorized remains
the same" - it's already mutated by previous passes anyway. E.g. what if the loop was
distributed/unrolled before vectorization?

     This is an interesting detail I haven't considered.
     If the loop is unrolled in order to basically perform strip-mining (and reduce control-flow instruction intensity, as I can see in the code generated by LoopVectorize.cpp) then I guess I could still have a chance in doing what I've written above, since unrolling doesn't change the rest of the code.
     However, for loop distribution (fission) the situation is more complicated - I need to better understand your LoopVectorize.cpp - but I understand that you can have in a loop several patterns to be vectorized independently and possibly also some parts of the loop that are not vectorizable (ex: computing Fibonacci numbers). Something like:
     for (i = 2; i < N; i++) {
         c[i] = a[i] + b[i];
         reduct[i] += a[i];
         fib[i] = fib[i - 1] + fib[i - 2];
     }
     Clearly, this loop can benefit from loop fission. I've checked this piece of code on LLVM and it does NOT get vectorized because of the fib computation and because LoopVectorize.cpp does not perform loop fission. But if we take out the fib computation it does get vectorized.
     So, I could still experiment carefully with C/C++ code generation from LoopVectorize.cpp, while I find a better solution .

     What other transformations would be beneficial for vectorization? I guess loop blocking, software pipelining, loop interchange, etc.

- the code that gets vectorized I want to translate in C code the parts that are
sequential and generate SIMD intrinsics for my SIMD processor where normally it would
generate vector instructions. I started looking at InnerLoopVectorizer::vectorize()
and InnerLoopVectorizer::createEmptyLoop(). Not generating LLVM code but C/C++ code
(with the help of LLVM intrinsics) is not trivial, but it should be reasonably simple
to achieve.

What you suggest here is like writing a C backend and teach it to generate intrinsics
for vector code (such backend existed some time ago btw). It should be doable, but I
wouldn't call it simple:)

     I thought of using a C back end of LLVM (such as https://github.com/JuliaComputing/llvm-cbe or https://github.com/draperlaboratory/llvm-cbe), but I did not do it because I thought it's no longer maintained, but actually it seems it is.

   Thank you,
      Alex

Thanks, Michael

Would you advise for such an operation as the one described above? I guess doing
this as a Clang phase (working on the source code) is not really a bad idea either,
since I would have better control on source code, but I would need to reimplement the
loop vectorizer algorithm that is currently implemented on LLVM code.

Thank you, Alex

Hi Alex,

I think the changes you want are actually not vectorizer related. Vectorizer just
uses data provided by other passes.

What you probably might want is to look into routine Loop::getStartLoc() (see
lib/Analysis/LoopInfo.cpp). If you find a way to improve it, patches are welcome:)

Thanks, Michael

Hello. Mikhail, I come back to this older thread. I need to do a few changes to
LoopVectorize.cpp.

One of them is related to figuring out the exact C source line and column number
of the loops being vectorized. I've noticed that a recent version of
LoopVectorize.cpp prints imprecise debug info for vectorized loops such as, for
example, the location of a character of an assignment statement inside the
respective loop. It would help me a lot in my project to find the exact C source
line and column number of the first and last character of the loop being
vectorized. (imprecise location would make my life more complicated). Is this
feasible? Or are there limitations at the level of clang of retrieving the exact
C source line and column number location of the beginning and end of a loop (it
can include indent chars before and after the loop)? (I've seen other examples
with imprecise location such as the "Reading diagnostics" chapter in the book
isbn:1782166939 - Căutare Google .)

Note: to be able to retrieve the debug info from the C source file we require to
run clang with -Rpass* options, as discussed before. Otherwise, if we run clang
first, then opt on the resulting .ll file which runs LoopVectorize, we lose the C
source file debug info (DebugLoc class, etc) and obtain the debug info from the
.ll file. An example: clang -O3 3better.c -arch=mips -ffast-math -Rpass=debug
-Rpass=loop-vectorize -Rpass-analysis=loop-vectorize -S -emit-llvm -fvectorize
-mllvm -debug -mllvm -force-vector-width=16 -save-temps

Thank you, Alex

Hi Alex,

I'm not aware of efforts on loop coalescing in LLVM, but probably polly can do
something like this. Also, one related thought: it might be worth making it a
separate pass, not a part of loop vectorizer. LLVM already has several
'utility' passes (e.g. loop rotation), which primarily aims at enabling other
passes.

Thanks, Michael

Hello, Michael. I come back to this older email. Sorry if you receive it
again.

I am trying to implement coalescing/collapsing of nested loops. This would
be clearly beneficial for the loop vectorizer, also. I'm normally planning to
start modifying the LLVM loop vectorizer to add loop coalescing of the LLVM
language.

Are you aware of a similar effort on loop coalescing in LLVM (maybe even a
different LLVM pass, not related to the LLVM loop vectorizer)?

Thank you, Alex

With best regards, Alex Susu

Hi Alex,

Example from the link you provided looks like this:

>for (i=0; i<M; i++ ){ z[i]=0; for (ckey=row_ptr[i];
ckey<row_ptr[i+1]; ckey++) { z[i] += data[ckey]*x[colind[ckey]]; } }|

Is it the loop you are trying to vectorize? I don’t see any ‘if’ inside
the innermost loop.

I tried to simplify this code in the hope the loop vectorizer can take care
of it better: I linearized...

But anyway, here vectorizer might have following troubles: 1) iteration
count of the innermost loop is unknown. 2) Gather accesses ( a[b[i]] ).
With AVX512 set of instructions it’s possible to generate efficient code
for such case, but a) I think it’s not supported yet, b) if this ISA
isn’t available, then vectorized code would need to ‘manually’ gather
scalar values to vector, which might be slow (and thus, vectorizer might
decide to leave the code scalar).

And here is a list of papers vectorizer is based on: // The
reduction-variable vectorization is based on the paper: // D. Nuzman and
R. Henderson. Multi-platform Auto-vectorization. // // Variable
uniformity checks are inspired by: // Karrenberg, R. and Hack, S. Whole
Function Vectorization. // // The interleaved access vectorization is
based on the paper: // Dorit Nuzman, Ira Rosen and Ayal Zaks.
Auto-Vectorization of Interleaved // Data for SIMD // // Other
ideas/concepts are from: // A. Zaks and D. Nuzman. Autovectorization in
GCC-two years later. // // S. Maleki, Y. Gao, M. Garzaran, T. Wong and
D. Padua. An Evaluation of // Vectorizing Compilers. And probably, some
of the parts are written from scratch with no reference to a paper.

The presentations you found are a good starting point, but while they’re
still good from getting basics of the vectorizer, they are a bit outdated
now in a sense that a lot of new features has been added since then (and
bugs fixed:) ). Also, I’d recommend trying a newer LLVM version - I don’t
think it’ll handle the example above, but it would be much more
convenient to investigate why the loop isn’t vectorized and fix
vectorizer if we figure out how.

Best regards, Michael

Thanks for the papers - these appear to be written in the header of the
file implementing the loop vect. tranformation (found at
"where-you-want-llvm-to-live"/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

).

Hello. I am trying to vectorize a CSR SpMV (sparse matrix vector
multiplication) procedure but the LLVM loop vectorizer is not able to
handle such code. I am using cland and llvm version 3.4 (on Ubuntu
12.10). I use the -fvectorize option with clang and -loop-vectorize
with opt-3.4 . The CSR SpMV function is inspired from
c - slow sparse matrix vector product (CSR) using open mp - Stack Overflow

(I can provide the exact code samples used).

Basically the problem is the loop vectorizer does NOT work with if
inside loop (be it 2 nested loops or a modification of SpMV I did with
just 1 loop - I can provide the exact code) changing the value of the
accumulator z. I can sort of understand why LLVM isn't able to
vectorize the code. However,
athttp://llvm.org/docs/Vectorizers.html#if-conversionit is written:
<<The Loop Vectorizer is able to "flatten" the IF statement in the code
and generate a single stream of instructions. The Loop Vectorizer
supports any control flow in the innermost loop. The innermost loop may
contain complex nesting of IFs, ELSEs and even GOTOs.>> Could you
please tell me what are these lines exactly trying to say.

Could you please tell me what algorithm is the LLVM loop vectorizer
using (maybe the algorithm is described in a paper) - I currently found
only 2 presentations on this
topic:http://llvm.org/devmtg/2013-11/slides/Rotem-Vectorization.pdfand
https://archive.fosdem.org/2014/schedule/event/llvmautovec/attachments/audio/321/export/events/attachments/llvmautovec/audio/321/AutoVectorizationLLVM.pdf

.

Thank you very much, Alex
_______________________________________________ LLVM Developers mailing
list LLVMdev@cs.uiuc.edu
<mailto:LLVMdev@cs.uiuc.edu><mailto:LLVMdev@cs.uiuc.edu>http://llvm.cs.uiuc.edu

<http://llvm.cs.uiuc.edu/&gt;

Hello.
     Christopher, please see answers below.

   Hello, Mikhail.
     I'm planning to do source-to-source transformation for loop
vectorization.
     Basically I want to generate C (C++) code from C (C++) source code:
       - the code that is not vectorized remains the same - this would be
simple to achieve if we can obtain precisely the source location of each
statement;
       - the code that gets vectorized I want to translate in C code the
parts that are sequential and generate SIMD intrinsics for my SIMD processor
where normally it would generate vector instructions.
      I started looking at InnerLoopVectorizer::vectorize() and
InnerLoopVectorizer::createEmptyLoop(). Not generating LLVM code but C/C++
code (with the help of LLVM intrinsics) is not trivial, but it should be
reasonably simple to achieve.

     Would you advise for such an operation as the one described above? I
guess doing this as a Clang phase (working on the source code) is not really
a bad idea either, since I would have better control on source code, but I
would need to reimplement the loop vectorizer algorithm that is currently
implemented on LLVM code.

vectorization is a coordination from high level optimizations like
loop level stuff and low level target stuff. If you are still at the
source level, how do you plan to handle the actual lowering?

     LoopVectorize.cpp has nothing to do with lowering, as far as I know.
     Vectorization was shown to work as a source-to-source transformation pass in the Scout project https://tu-dresden.de/die_tu_dresden/zentrale_einrichtungen/zih/forschung/projekte/scout/publications . In their case the generated code is the source code somewhat transformed and augmented with x86 intrinsics (they have implemented probably? vector data-types directly in the AST).
   But one could go further: we could have C code with vector data types (for example the OpenCL kernel language) and we can compile this code with an OpenCL compiler.

> In that

case you'll still always be at the mercy of another piece, which may
or may not be able to handle what you've done. (In theory your
transformation could be correct, but backend just not handle it)

Having said this - why not actually work on fixing the root of the
"problem" - that being the actual llvm passes which aren't doing what
you need. This would also likely be more robust and you can maintain
control over the whole experiment (compilation flow)

     Indeed, it seems that working on LoopVectorize.cpp is not the best idea (Mikhail noted that loop transformations like loop fission, currently not implemented, can disallow normally doing source-to-source transformation from LoopVectorize.cpp), but it seems to be OK for the moment.
     But I also need to do instruction selection for the SIMD/vector unit and best is to let the LLVM back end do this. The Scout project does instr selection in the ~frontend and I guess this could be suboptimal since it does not use LLVM's register allocator, etc.
     Actually any thought on this aspect is welcome (or similarly put: how do x86 intrinsics do register allocation - see Intel Developer Zone, An Introduction to GCC Compiler Intrinsics in Vector Processing | Linux Journal).

I get really annoyed when reviewing papers from academics who have
used source-to-source because they thought it was "easier". Short term
short-cuts aren't likely going to produce novel results..
.

     Although I haven't worked much on source-to-source transformation it seems to allow easier optimization for data-structures than when working with LLVM-IR.
     But deciding on the right place to implement well such a transformation pass in the compilation flow seems to be a rather difficult decision.

   Thank you,
     Alex

Hello.
      Hal, Adam, thank you very much for the fix mentioned. I ran an opt built with this fix and I got the precise start loop location.
      I am interested in getting both exact start and end locations for a loop in order to replace the loop with a different content in the source file (basically perform a rather non-standard source-to-source transformation).

     I've tried to compute the end location for the loop by "parsing" the file (looking at each character) at least from the start location, but this can be quite complex for nested blocks in the loop, etc.

     Also, I've tried to get more information from the LLVM IR instructions:
        - Loop::getUniqueExitBlock()::front()::getDebugLoc() returns the first statement after the loop. But, in the case of a nested loop the first (and last) statement after the loop is the "increment" statement in the outer enclosing loop. So, even if for simple loops getUniqueExitBlock etc looks promising, this is still not great.
        - I also iterated through all the statements of all basic-blocks of the loop (used Loop::block_begin() and block_end(), etc). From these I can choose the the min and max locations. This is not great either because the loops can contain comments before the final "}" of the loop (if there is one) and this would result in imprecise end location - most importantly the "}" of the loop basically does not have a corresponding LLVM IR instruction. Of course "parsing" to the right of the max location found above for an uncommented "}" is not very difficult.

     I could also try to get more information from the AST of Clang while being in the opt tool, but I don't know how to read it - maybe I could use Libtooling. Do you have an idea here?
     Can I get the corresponding AST node from an LLVM IR instruction? (Here I got an interesting pointer: http://clang-developers.42468.n3.nabble.com/Matching-Clang-s-AST-nodes-to-the-LLVM-IR-instructions-they-produced-td3665037.html, but maybe it is outdated)

     Thank you,
       Alex

Hi Alex,

If you want to get both the starting and ending locations, I think your best bet is to enhance Clang to insert into the loop metadata, not just the location of the start of the loop, but also the location of the end of the loop. Then you can grab that in the backend.

What's your use case for this exactly?

-Hal

Hello.
     Hal, thank you very much - if I have to make my application 100% reliable I might enhance Clang as you suggested.

     As I've already suggested, I am interested in getting both exact start and end locations for a loop in order to replace it with a different loop with different content (using vector intrinsics) in the source file - so basically I want to perform a rather non-standard source-to-source transformation.

   Best regards,
     Alex

Hi Alex,

If you want to get both the starting and ending locations, I think your best bet is to
enhance Clang to insert into the loop metadata, not just the location of the start of
the loop, but also the location of the end of the loop. Then you can grab that in the
backend.

What's your use case for this exactly?

-Hal

From: "Alex Susu" <alex.e.susu@gmail.com> To: "llvm-dev" <llvm-dev@lists.llvm.org>
Cc: "Adam Nemet" <anemet@apple.com>, "Hal Finkel" <hfinkel@anl.gov> Sent: Friday,
August 12, 2016 4:43:27 AM Subject: Re: [llvm-dev] [LLVMdev] LLVM loop vectorizer -
start and end locations

Hello. Hal, Adam, thank you very much for the fix mentioned. I ran an opt built with
this fix and I got the precise start loop location. I am interested in getting both
exact start and end locations for a loop in order to replace the loop with a
different content in the source file (basically perform a rather non-standard
source-to-source transformation).

I've tried to compute the end location for the loop by "parsing" the file (looking
at each character) at least from the start location, but this can be quite complex
for nested blocks in the loop, etc.

Also, I've tried to get more information from the LLVM IR instructions: -
Loop::getUniqueExitBlock()::front()::getDebugLoc() returns the first statement after
the loop. But, in the case of a nested loop the first (and last) statement after the
loop is the "increment" statement in the outer enclosing loop. So, even if for
simple loops getUniqueExitBlock etc looks promising, this is still not great. - I
also iterated through all the statements of all basic-blocks of the loop (used
Loop::block_begin() and block_end(), etc). From these I can choose the the min and
max locations. This is not great either because the loops can contain comments before
the final "}" of the loop (if there is one) and this would result in imprecise end
location - most importantly the "}" of the loop basically does not have a
corresponding LLVM IR instruction. Of course "parsing" to the right of the max
location found above for an uncommented "}" is not very difficult.

I could also try to get more information from the AST of Clang while being in the
opt tool, but I don't know how to read it - maybe I could use Libtooling. Do you have
an idea here? Can I get the corresponding AST node from an LLVM IR instruction? (Here
I got an interesting pointer:
http://clang-developers.42468.n3.nabble.com/Matching-Clang-s-AST-nodes-to-the-LLVM-IR-instructions-they-produced-td3665037.html,

but maybe it is outdated)

Thank you, Alex

Hi Alex,

This has been very recently fixed by Hal. See http://reviews.llvm.org/rL270771

Adam

Hello. Mikhail, I come back to this older thread. I need to do a few changes to
LoopVectorize.cpp.

One of them is related to figuring out the exact C source line and column number
of the loops being vectorized. I've noticed that a recent version of
LoopVectorize.cpp prints imprecise debug info for vectorized loops such as, for
example, the location of a character of an assignment statement inside the
respective loop. It would help me a lot in my project to find the exact C source
line and column number of the first and last character of the loop being
vectorized. (imprecise location would make my life more complicated). Is this
feasible? Or are there limitations at the level of clang of retrieving the exact
C source line and column number location of the beginning and end of a loop (it
can include indent chars before and after the loop)? (I've seen other examples
with imprecise location such as the "Reading diagnostics" chapter in the book
isbn:1782166939. - Căutare Google)

Note: to be able to retrieve the debug info from the C source file we require to
run clang with -Rpass* options, as discussed before. Otherwise, if we run clang
first, then opt on the resulting .ll file which runs LoopVectorize, we lose the C
source file debug info (DebugLoc class, etc) and obtain the debug info from the
.ll file. An example: clang -O3 3better.c -arch=mips -ffast-math -Rpass=debug
-Rpass=loop-vectorize -Rpass-analysis=loop-vectorize -S -emit-llvm -fvectorize
-mllvm -debug -mllvm -force-vector-width=16 -save-temps

Thank you, Alex

Hi Alex,

I'm not aware of efforts on loop coalescing in LLVM, but probably polly can do
something like this. Also, one related thought: it might be worth making it a
separate pass, not a part of loop vectorizer. LLVM already has several
'utility' passes (e.g. loop rotation), which primarily aims at enabling other
passes.

Thanks, Michael

Hello, Michael. I come back to this older email. Sorry if you receive it
again.

I am trying to implement coalescing/collapsing of nested loops. This would
be clearly beneficial for the loop vectorizer, also. I'm normally planning to
start modifying the LLVM loop vectorizer to add loop coalescing of the LLVM
language.

Are you aware of a similar effort on loop coalescing in LLVM (maybe even a
different LLVM pass, not related to the LLVM loop vectorizer)?

Thank you, Alex

With best regards, Alex Susu

Hi Alex,

Example from the link you provided looks like this:

>for (i=0; i<M; i++ ){ z[i]=0; for (ckey=row_ptr[i]; |
ckey<row_ptr[i+1]; ckey++) { z[i] += data[ckey]*x[colind[ckey]]; } }|

Is it the loop you are trying to vectorize? I don’t see any ‘if’ inside
the innermost loop.

I tried to simplify this code in the hope the loop vectorizer can take care
of it better: I linearized...

But anyway, here vectorizer might have following troubles: 1) iteration
count of the innermost loop is unknown. 2) Gather accesses ( a[b[i]] ).
With AVX512 set of instructions it’s possible to generate efficient code
for such case, but a) I think it’s not supported yet, b) if this ISA
isn’t available, then vectorized code would need to ‘manually’ gather
scalar values to vector, which might be slow (and thus, vectorizer might
decide to leave the code scalar).

And here is a list of papers vectorizer is based on: // The
reduction-variable vectorization is based on the paper: // D. Nuzman and
R. Henderson. Multi-platform Auto-vectorization. // // Variable
uniformity checks are inspired by: // Karrenberg, R. and Hack, S. Whole
Function Vectorization. // // The interleaved access vectorization is
based on the paper: // Dorit Nuzman, Ira Rosen and Ayal Zaks.
Auto-Vectorization of Interleaved // Data for SIMD // // Other
ideas/concepts are from: // A. Zaks and D. Nuzman. Autovectorization in
GCC-two years later. // // S. Maleki, Y. Gao, M. Garzaran, T. Wong and
D. Padua. An Evaluation of // Vectorizing Compilers. And probably, some
of the parts are written from scratch with no reference to a paper.

The presentations you found are a good starting point, but while they’re
still good from getting basics of the vectorizer, they are a bit outdated
now in a sense that a lot of new features has been added since then (and
bugs fixed:) ). Also, I’d recommend trying a newer LLVM version - I
don’t think it’ll handle the example above, but it would be much more
convenient to investigate why the loop isn’t vectorized and fix
vectorizer if we figure out how.

Best regards, Michael

Thanks for the papers - these appear to be written in the header of the
file implementing the loop vect. tranformation (found at
"where-you-want-llvm-to-live"/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

).

Hello. I am trying to vectorize a CSR SpMV (sparse matrix vector
multiplication) procedure but the LLVM loop vectorizer is not able to
handle such code. I am using cland and llvm version 3.4 (on Ubuntu
12.10). I use the -fvectorize option with clang and -loop-vectorize
with opt-3.4 . The CSR SpMV function is inspired from
c - slow sparse matrix vector product (CSR) using open mp - Stack Overflow

(I can provide the exact code samples used).

Basically the problem is the loop vectorizer does NOT work with if
inside loop (be it 2 nested loops or a modification of SpMV I did with
just 1 loop - I can provide the exact code) changing the value of the
accumulator z. I can sort of understand why LLVM isn't able to
vectorize the code. However,
athttp://llvm.org/docs/Vectorizers.html#if-conversionitis written:
<<The Loop Vectorizer is able to "flatten" the IF statement in the
code and generate a single stream of instructions. The Loop Vectorizer
supports any control flow in the innermost loop. The innermost loop may
contain complex nesting of IFs, ELSEs and even GOTOs.>> Could you
please tell me what are these lines exactly trying to say.

Could you please tell me what algorithm is the LLVM loop vectorizer
using (maybe the algorithm is described in a paper) - I currently found
only 2 presentations on this
topic:http://llvm.org/devmtg/2013-11/slides/Rotem-Vectorization.pdfand
<http://llvm.org/devmtg/2013-11/slides/Rotem-Vectorization.pdfand&gt;
https://archive.fosdem.org/2014/schedule/event/llvmautovec/attachments/audio/321/export/events/attachments/llvmautovec/audio/321/AutoVectorizationLLVM.pdf

.

Thank you very much, Alex
_______________________________________________ LLVM Developers mailing
listLLVMdev@cs.uiuc.edu <mailto:LLVMdev@cs.uiuc.edu>
<mailto:LLVMdev@cs.uiuc.edu><mailto:LLVMdev@cs.uiuc.edu>http://llvm.cs.uiuc.edu

<http://llvm.cs.uiuc.edu/&gt;