About Clang llvm PGO

davidxl · May 7, 2016, 12:06am

Thanks for testing out LLVM PGO and evaluated the performance.

We are currently still more focused on infrastructure improvement which is the foundation for performance improvement. We are making great progress in this direction, but there are still some key missing pieces such as profile data in inliner etc. We are working on that. Once those are done, more focus will be on making more passes profile aware, make existing profile aware passes better (e.g, code layout etc).

I looked at this particular example. GCC PGO can reduce the runtime by half, while LLVM’s PGO makes no performance difference as you noticed.

For GCC case, PGO itself contributes about 15% performance boost. The majority of the performance improvement comes from loop vectorization. Note that trunk GCC does not turn on vectorization at O2, but O3 or O2 with PGO.

LLVM also vectorizes the key loops. However compared with GCC’s vectorizor, LLVM’s auto-vectorizer produces worse code (e.g, long sequence of instructions to do sign extension etc): ~6.5instr/iter vs ~9instr/iter. GCC also does loop unroll after vectorization which also helped a little more. LLVM’s vectorization actually hurts performance a little.

We will look into this issue.

thanks,

David

Jie_Chen · May 8, 2016, 9:14pm

Hi David,

Thanks for your great explanations not only covering llvm but also gcc! To understand the code layout optimization better, I slightly changed my code, basically, calling the hot() function in the first if-branch instead of at the last else branch (see my modified code below). This essentially reduces branch instructions being executed, and possibly improves the branch predictor performance. On my Mac, I got ~6% performance improvement (clang++ -O2) with this code change. Looking at the default.profraw data, I can see it has the information that the optimizer could use to make a similar optimization as my manual approach. I was hoping llvm PGO could do the same thing.

I am excited to hear from you that more infrastructure changes are undergoing which will improve the PGO support. So as for now, what is the list of PGO optimizations that I can write some code and see immediate improvement from llvm? It would be great to know such details.

Best,

Jie

//main2.cpp: manual reordering of branches

#include
#include <stdlib.h>

using namespace std;

long long hot() {
long long x = 0;

for (int i = 0; i < 1000; i++) {
x += i^2;
}

return x;
}

long long cold() {
long long y = 0;

for (int i = 0; i < 1000; i++) {
y += i^2;
}

return y;

}

long long foo() {
long long y = 0;

for (int i = 0; i < 1000; i++) {
y *= i^2;
}

return y*2;

}

long long bar() {
long long y = 0;

for (int i = 0; i < 1000; i++) {
y *= i^2;
}

return y*3;

}

#define SIZE 10000000

int main() {

int* a = (int *)calloc(SIZE, sizeof(int));

a[100] = 1;

long long sum = 0;

for (int i = 0; i < SIZE; i++) {
if (a[i] < 1) {
sum += hot();
} else if (a[i] == 1) {
sum += cold();
} else if (a[i] < 1) {
sum += bar();
sum += foo();
}
}

cout << sum << endl;

return 0;
}

davidxl · May 8, 2016, 11:59pm

Hi David,

Thanks for your great explanations not only covering llvm but also gcc! To
understand the code layout optimization better, I slightly changed my code,
basically, calling the hot() function in the first if-branch instead of at
the last else branch (see my modified code below). This essentially reduces
branch instructions being executed, and possibly improves the branch
predictor performance. On my Mac, I got ~6% performance improvement
(clang++ -O2) with this code change. Looking at the default.profraw data, I
can see it has the information that the optimizer could use to make a
similar optimization as my manual approach. I was hoping llvm PGO could
do the same thing.

yes -- this is a missing profile guided control flow optimization --
reducing hot path's control-dependence height by branch re-ordering --
possible when branch conditions are mutually exclusive.

I am excited to hear from you that more infrastructure changes are
undergoing which will improve the PGO support. So as for now, what is the
list of PGO optimizations that I can write some code and see
immediate improvement from llvm? It would be great to know such details.

What I can tell you is that there are many missing ones (that can benefit

from profile): such as profile aware LICM (patch pending), speculative PRE,
loop unrolling, loop peeling, auto vectorization, inlining, function
splitting, function layout, function outlinling, profile driven size
optimization, induction variable optimization/strength reduction, stringOp
specialization/optimization/inlining, switch peeling/lowering etc. The
biggest profile user today include ralloc, BB layout, ifcvt, shrinkwrapping
etc, but there should be rooms to be improvement there too.

thanks,

David

Jie_Chen · May 9, 2016, 1:54pm

Hi David,

Your inputs are very helpful! Now I have an idea of where are the missing pieces. Looking forward to seeing more improvements in llvm, especially in PGO.

Thanks again for your and your colleagues’ work on the llvm project.

Jie

Topic		Replies	Views
Current PGO status LLVM Dev List Archives	8	106	February 26, 2018
Capabilities of Clang's PGO (e.g. improving code density) LLVM Dev List Archives	14	98	May 28, 2015
Making Clang/LLVM faster using code layout optimizations LLVM Dev List Archives	3	147	October 19, 2018
RFC: PGO Late instrumentation for LLVM LLVM Dev List Archives	1	81	September 2, 2015
RFC: PGO Late instrumentation for LLVM LLVM Dev List Archives	1	100	September 2, 2015

About Clang llvm PGO

Related Topics