(Adding “LLVM Dev”)
My variant is up as https://reviews.llvm.org/D23876
—david
(Adding “LLVM Dev”)
My variant is up as https://reviews.llvm.org/D23876
—david
Hey David,
I’ll take a look at the patch
Sounds like fun work.
As George says, improving AA significantly will almost always cause significant performance regressions at first, in almost any compiler.
Compilers knobs, passes, usually get tuned for x amount of freedom, and if you give them 10x, they start moving things too far, vectorizing too much, spilling, etc.
This was definitely the case for GCC, where adding a precise interprocedural field-sensitive analysis initially regressed performance by a few percent on average.
I know it was also the case for XLC at IBM, etc.
Like anything else, just gotta figure out what passes are going nuts, and rework them to have better heuristics/etc.
The end result is performance improvements, but the path takes a bit of time.
If you need a way to see whether your analysis has actually done an okay job in the meantime, usually a good way to see if you are doing well or not is to see how many loads/stores get eliminated or moved by various passes before and after.
If the number is significantly higher, great.
If the number is significantly lower, something has likely gone wrong
I did gathered aggregate statistics reported by “-stats” over the ~400 test files.
The following table summarizes the impact. The first column is the
sum where the new analysis is enabled, the second column is the
delta from baseline where no CFL alias analysis is performed. I am not
experienced enough to know which of these are “good” or “bad” indicators.
—david
72,250 685 SLP # vector instructions generated
1,256,401 566 adce # instructions removed
67,020,774 13,835,126 basicaa # times a GEP is decomposed
11,154 26 basicaa # times the limit to decompose GEPs is reached
153,613 324 bdce # instructions removed (unused)
198,495 2 bdce # instructions trivialized (dead bits)
298,621 0 cfl-od-aa Maximum compressed graph
58,462,719 0 cfl-od-aa Number Search Steps
48,401 0 cfl-od-aa # NoAlias results absed on address roots
61,936 0 cfl-od-aa # NoAlias results on compressed search path
3,768,131 0 cfl-od-aa # NoAlias results on fast path
47,016,909 0 cfl-od-aa # calls to query()
43,172,261 0 cfl-od-aa # instructions analyzed
10,515,257 0 cfl-od-aa # times there was no graph node for a value
9,895,755 0 cfl-od-aa Total size of compressed graphs (edges)
2,797 2 correlated-value-propagation # comparisons propagated
66,515 -126 correlated-value-propagation # phis propagated
912 3 correlated-value-propagation # sdiv converted to udiv
13,527 501 dse # other instrs removed
40,973 416 dse # stores deleted
126 -2 early-cse # compare instructions CVP’d
1,824,703 -138 early-cse # instructions CSE’d
1,875,417 87 early-cse # instructions simplified or DCE’d
62,505 1 functionattrs # arguments marked nocapture
29,979 1 functionattrs # arguments marked readonly
42,648 37 globaldce # functions removed
40,498 10 globaldce # global variables removed
4,368 35 gvn # blocks merged
21,961 26 gvn # equalities propagated
29,434 45 gvn # instructions PRE’d
631,597 3,307 gvn # instructions deleted
217,618 494 gvn # instructions simplified
51,089 634 gvn # loads PRE’d
135,568 1,526 gvn # loads deleted
2,197 4 indvars # IV comparisons eliminated
826 8 indvars # congruent IVs eliminated
2,538 4 indvars # exit values replaced
1,856 1 indvars # loop exit tests replaced
5,740,738 8 inline # caller-callers analyzed
1,629,169 3 inline # functions deleted because all callers found
3,563,497 2 inline # functions inlined
10,879,125 86 inline-cost # call sites analyzed
34,766 5 instcombine # constant folds
3,979,078 2,004 instcombine # dead inst eliminated
6,323 2 instcombine # dead stores eliminated
1,522 4 instcombine # factorizations
254,146 66 instcombine # instructions sunk
10,427,131 1,749 instcombine # insts combined
57,943 -205 instcombine # reassociations
1,072 1 instsimplify # expansions
135,129 1 instsimplify # reassociations
121,777 246 instsimplify # redundant instructions removed
27,612 -12 jump-threading # branch blocks duplicated to eliminate phi
76,000 197 jump-threading # jumps threaded
4,991 8 jump-threading # terminators folded
869,838 1,370 lcssa # live out of a loop variables
345,329 433 licm # instructions hoisted out of loop
702 -27 licm # instructions sunk out of loop
19,520 192 licm # load insts hoisted or sunk
202 37 licm # memory locations promoted to registers
467,244 246 local # unreachable basic blocks removed
1,586 34 loop-delete # loops deleted
84 27 loop-idiom # memcpy’s formed from loop load+stores
752 7 loop-idiom # memset’s formed from loop stores
63,364 -8 loop-rotate # loops rotated
4,602 1 loop-simplify # nested loops split out
1,244,741 472 loop-simplify # pre-header or exit blocks inserted
2,847 2 loop-unroll # loops completely unrolled
9,668 -29 loop-unroll # loops unrolled (completely or otherwise)
5,799 -35 loop-unroll # loops unrolled with run-time trip counts
3,863 25 loop-unswitch # branches unswitched
1,054,060 1,482 loop-unswitch Total number of instructions analyzed
109,279 -3 loop-vectorize # loops analyzed for vectorization
526,766 -136 mem2reg # PHI nodes inserted
4,150,078 -3 mem2reg # alloca’s promoted with a single store
4,567 6 memcpyopt # memcpy instructions deleted
96 1 memcpyopt # memcpys converted to memset
1,074 173 memcpyopt # memmoves converted to memcpy
39,584 6 memcpyopt # memsets inferred
179,629 2,475 memdep # block queries that were completely cached
1,020 -3 memdep # cached, but dirty, non-local ptr responses
9,108,504 146,792 memdep # fully cached non-local ptr responses
11,678,674 92,225 memdep # uncached non-local ptr responses
399,802 1,832 memory-builtins # arguments with unsolved size and offset
10,844 -24,169 memory-builtins # load instructions with unsolved size and offset
188,181 54 reassociate # insts reassociated
87,009 -82 scalar-evolution # loops with predictable loop counts
402,724 71 scalar-evolution # loops without predictable loop counts
133,310 72 sccp # basic blocks unreachable
275,949 263 sccp # instructions removed
2,056,414 723 simplifycfg # blocks simplified
5,292 -36 simplifycfg # common instructions sunk down to the end block
15,110 1 simplifycfg # speculative executed instructions
43,068 -2 sroa Maximum number of uses of a partition
11,754,901 -180 sroa # alloca partition uses rewritten
4,623,115 -11 sroa # alloca partitions formed
5,927,727 -11 sroa # allocas analyzed for replacement
4,576,406 -5 sroa # allocas promoted to SSA values
13,770,636 -227 sroa # instructions deleted
3,797 -1 strip-dead-prototypes # dead prototypes removed
Okay, dumb question:
Are you really getting negative numbers in the second column?
526,766 -136 mem2reg # PHI nodes inserted
http://llvm.org/docs/doxygen/html/PromoteMemoryToRegister_8cpp_source.html
(Search for NumPHIInsert).
I don’t see how it could be negative unless this wrapped around?
(and sys::cas_flag that STATISTIC uses is a uint32 …)
The second column is the difference between the new analysis and a baseline without the analysis.
Here is a summary of the experiment behind this patch
https://www.facebook.com/notes/david-callahan/llvm-cfl-alias-analysis/10150648030284971
One of the things throwing me off is the cfl numbers being 0
Outside of one number, they could be deltas in either direction.
IE baseline is column A - column B, or baseline is column B - column A
However, given:
10,844 -24,169
I presume this means the baseline is 34k?
If so, it looks like you are promoting, hoisting, deleting, vectorizing etc a bunch more instructions. About 1-5, most things being 1%, a few being 5%
SROA seems to do very slightly worse, but it’s not clear if that’s because the instructions got deleted before it did anything.
Note that this kind of “graph CSE” is the basis of HVN/HU/HRU in
https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=0ahUKEwjEupa6t93OAhUE1mMKHbsCDU4QFggeMAE&url=https%3A%2F%2Fwww.cs.ucsb.edu%2F~benh%2Fresearch%2Fpapers%2Fhardekopf07exploiting.pdf&usg=AFQjCNGNzQ6vgsfxWRMW3a5aA4fGGLADOQ&sig2=3mC64txCKq5daxQqMK7UIA&bvm=bv.130731782,d.cGc
You could do the same analysis on this graph.