CFLAA

David_Callahan1 · August 25, 2016, 3:11pm

(Adding “LLVM Dev”)

My variant is up as https://reviews.llvm.org/D23876
—david

Daniel_Berlin1 · August 25, 2016, 4:06pm

Hey David,
I’ll take a look at the patch
Sounds like fun work.

As George says, improving AA significantly will almost always cause significant performance regressions at first, in almost any compiler.

Compilers knobs, passes, usually get tuned for x amount of freedom, and if you give them 10x, they start moving things too far, vectorizing too much, spilling, etc.

This was definitely the case for GCC, where adding a precise interprocedural field-sensitive analysis initially regressed performance by a few percent on average.

I know it was also the case for XLC at IBM, etc.

Like anything else, just gotta figure out what passes are going nuts, and rework them to have better heuristics/etc.
The end result is performance improvements, but the path takes a bit of time.

If you need a way to see whether your analysis has actually done an okay job in the meantime, usually a good way to see if you are doing well or not is to see how many loads/stores get eliminated or moved by various passes before and after.

If the number is significantly higher, great.
If the number is significantly lower, something has likely gone wrong

David_Callahan1 · August 25, 2016, 4:49pm

I did gathered aggregate statistics reported by “-stats” over the ~400 test files.

The following table summarizes the impact. The first column is the

sum where the new analysis is enabled, the second column is the

delta from baseline where no CFL alias analysis is performed. I am not

experienced enough to know which of these are “good” or “bad” indicators.

—david

72,250 685 SLP # vector instructions generated

1,256,401 566 adce # instructions removed

67,020,774 13,835,126 basicaa # times a GEP is decomposed

11,154 26 basicaa # times the limit to decompose GEPs is reached

153,613 324 bdce # instructions removed (unused)

198,495 2 bdce # instructions trivialized (dead bits)

298,621 0 cfl-od-aa Maximum compressed graph

58,462,719 0 cfl-od-aa Number Search Steps

48,401 0 cfl-od-aa # NoAlias results absed on address roots

61,936 0 cfl-od-aa # NoAlias results on compressed search path

3,768,131 0 cfl-od-aa # NoAlias results on fast path

47,016,909 0 cfl-od-aa # calls to query()

43,172,261 0 cfl-od-aa # instructions analyzed

10,515,257 0 cfl-od-aa # times there was no graph node for a value

9,895,755 0 cfl-od-aa Total size of compressed graphs (edges)

2,797 2 correlated-value-propagation # comparisons propagated

66,515 -126 correlated-value-propagation # phis propagated

912 3 correlated-value-propagation # sdiv converted to udiv

13,527 501 dse # other instrs removed

40,973 416 dse # stores deleted

126 -2 early-cse # compare instructions CVP’d

1,824,703 -138 early-cse # instructions CSE’d

1,875,417 87 early-cse # instructions simplified or DCE’d

62,505 1 functionattrs # arguments marked nocapture

29,979 1 functionattrs # arguments marked readonly

42,648 37 globaldce # functions removed

40,498 10 globaldce # global variables removed

4,368 35 gvn # blocks merged

21,961 26 gvn # equalities propagated

29,434 45 gvn # instructions PRE’d

631,597 3,307 gvn # instructions deleted

217,618 494 gvn # instructions simplified

51,089 634 gvn # loads PRE’d

135,568 1,526 gvn # loads deleted

2,197 4 indvars # IV comparisons eliminated

826 8 indvars # congruent IVs eliminated

2,538 4 indvars # exit values replaced

1,856 1 indvars # loop exit tests replaced

5,740,738 8 inline # caller-callers analyzed

1,629,169 3 inline # functions deleted because all callers found

3,563,497 2 inline # functions inlined

10,879,125 86 inline-cost # call sites analyzed

34,766 5 instcombine # constant folds

3,979,078 2,004 instcombine # dead inst eliminated

6,323 2 instcombine # dead stores eliminated

1,522 4 instcombine # factorizations

254,146 66 instcombine # instructions sunk

10,427,131 1,749 instcombine # insts combined

57,943 -205 instcombine # reassociations

1,072 1 instsimplify # expansions

135,129 1 instsimplify # reassociations

121,777 246 instsimplify # redundant instructions removed

27,612 -12 jump-threading # branch blocks duplicated to eliminate phi

76,000 197 jump-threading # jumps threaded

4,991 8 jump-threading # terminators folded

869,838 1,370 lcssa # live out of a loop variables

345,329 433 licm # instructions hoisted out of loop

702 -27 licm # instructions sunk out of loop

19,520 192 licm # load insts hoisted or sunk

202 37 licm # memory locations promoted to registers

467,244 246 local # unreachable basic blocks removed

1,586 34 loop-delete # loops deleted

84 27 loop-idiom # memcpy’s formed from loop load+stores

752 7 loop-idiom # memset’s formed from loop stores

63,364 -8 loop-rotate # loops rotated

4,602 1 loop-simplify # nested loops split out

1,244,741 472 loop-simplify # pre-header or exit blocks inserted

2,847 2 loop-unroll # loops completely unrolled

9,668 -29 loop-unroll # loops unrolled (completely or otherwise)

5,799 -35 loop-unroll # loops unrolled with run-time trip counts

3,863 25 loop-unswitch # branches unswitched

1,054,060 1,482 loop-unswitch Total number of instructions analyzed

109,279 -3 loop-vectorize # loops analyzed for vectorization

526,766 -136 mem2reg # PHI nodes inserted

4,150,078 -3 mem2reg # alloca’s promoted with a single store

4,567 6 memcpyopt # memcpy instructions deleted

96 1 memcpyopt # memcpys converted to memset

1,074 173 memcpyopt # memmoves converted to memcpy

39,584 6 memcpyopt # memsets inferred

179,629 2,475 memdep # block queries that were completely cached

1,020 -3 memdep # cached, but dirty, non-local ptr responses

9,108,504 146,792 memdep # fully cached non-local ptr responses

11,678,674 92,225 memdep # uncached non-local ptr responses

399,802 1,832 memory-builtins # arguments with unsolved size and offset

10,844 -24,169 memory-builtins # load instructions with unsolved size and offset

188,181 54 reassociate # insts reassociated

87,009 -82 scalar-evolution # loops with predictable loop counts

402,724 71 scalar-evolution # loops without predictable loop counts

133,310 72 sccp # basic blocks unreachable

275,949 263 sccp # instructions removed

2,056,414 723 simplifycfg # blocks simplified

5,292 -36 simplifycfg # common instructions sunk down to the end block

15,110 1 simplifycfg # speculative executed instructions

43,068 -2 sroa Maximum number of uses of a partition

11,754,901 -180 sroa # alloca partition uses rewritten

4,623,115 -11 sroa # alloca partitions formed

5,927,727 -11 sroa # allocas analyzed for replacement

4,576,406 -5 sroa # allocas promoted to SSA values

13,770,636 -227 sroa # instructions deleted

3,797 -1 strip-dead-prototypes # dead prototypes removed

Daniel_Berlin1 · August 25, 2016, 4:54pm

Okay, dumb question:
Are you really getting negative numbers in the second column?

526,766 -136 mem2reg # PHI nodes inserted

http://llvm.org/docs/doxygen/html/PromoteMemoryToRegister_8cpp_source.html

(Search for NumPHIInsert).

I don’t see how it could be negative unless this wrapped around?

Daniel_Berlin1 · August 25, 2016, 4:55pm

(and sys::cas_flag that STATISTIC uses is a uint32 …)

David_Callahan1 · August 25, 2016, 4:56pm

The second column is the difference between the new analysis and a baseline without the analysis.

David_Callahan1 · August 25, 2016, 5:17pm

Here is a summary of the experiment behind this patch
https://www.facebook.com/notes/david-callahan/llvm-cfl-alias-analysis/10150648030284971

Daniel_Berlin1 · August 25, 2016, 8:43pm

One of the things throwing me off is the cfl numbers being 0

Outside of one number, they could be deltas in either direction.

IE baseline is column A - column B, or baseline is column B - column A

However, given:

10,844 -24,169

I presume this means the baseline is 34k?

If so, it looks like you are promoting, hoisting, deleting, vectorizing etc a bunch more instructions. About 1-5, most things being 1%, a few being 5%

SROA seems to do very slightly worse, but it’s not clear if that’s because the instructions got deleted before it did anything.

Daniel_Berlin1 · August 25, 2016, 8:45pm

Note that this kind of “graph CSE” is the basis of HVN/HU/HRU in
https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=0ahUKEwjEupa6t93OAhUE1mMKHbsCDU4QFggeMAE&url=https%3A%2F%2Fwww.cs.ucsb.edu%2F~benh%2Fresearch%2Fpapers%2Fhardekopf07exploiting.pdf&usg=AFQjCNGNzQ6vgsfxWRMW3a5aA4fGGLADOQ&sig2=3mC64txCKq5daxQqMK7UIA&bvm=bv.130731782,d.cGc

You could do the same analysis on this graph.

Topic		Replies	Views
[GSoC 2016] Introduction & Feedback - Better Alias Analysis LLVM Dev List Archives	7	85	May 13, 2016
Testing the new CFL alias analysis LLVM Dev List Archives	8	145	September 20, 2014
[GSoC 2016] Proposal: CFL-AA by default LLVM Dev List Archives	3	98	March 25, 2016
CFLAA LLVM Dev List Archives	3	69	August 26, 2016
Comparison of Alias Analysis in LLVM LLVM Dev List Archives	8	84	January 5, 2012

CFLAA

Related Topics