selection dag speedups / llc speedups

Hello folks,

I’m sure this has been asked many times, but is there current work on decreasing the time taken by the DAG-based instruction selector, or the other phases of llc? I am just beginning to dive into LLVM, and I am interested in compile-time reductions that do not reduce code quality dramatically. For example, simply switching on “-fast-isel” (roughly 17% drop in code quality for some of my benchmarks) or “-regalloc=local/fast” (roughly 2x slower for some of my benchmarks) is, unfortunately, not an acceptable option.

I noticed that a semi-recent commit added the “STATISTIC(NumDAGIselRetries,“Number of times dag isel has to try another path”);” as well as a sorting change in DAGISelEmitter.cpp. This suggests to me that there is work being done. Are there other similar efforts, but perhaps aiming for bigger gains, that I should be aware of? From the other end, are there efforts to make the fast-path produce better quality code?

Some small things I’ve noticed:

  • LegalizeVectors might not make a single change over an entire input bc file, yet it can take about 6% of the ISel time. Identifying that this can be skipped ahead of time may help?
  • We can take the Changed flag returned by LegalizeVectors, and pass it on to Legalize so that it can decide not to AssignTopologicalOrder again, since it was already done in the LegalizeVectors pass. However, this is a negligible reduction: 0.2 seconds shaved off a 30+ second compile =).
  • BuildSchedGraph() can take around 30% of the X% of time taken by Instruction Scheduling. How different is the scheduler graph from the selection graph?
  • Just as a random guess that this would not affect code quality terribly, I tried dynamically switching on -fast-isel for basic blocks of size one. This can cover about 10% of the basic blocks. The code-quality drop on a few benchmarks is neglible, but for spec’s 483.xalancbmk it is quite dramatic.

Also, any pointers to documentation, or description of the algorithm used by the Select() → SelectCode() → SelectCodeCommon() pass would be very helpful.

Sorry for the long email, and many thanks in advance!

  • Jan Voung

Hello folks,

I'm sure this has been asked many times, but is there current work on decreasing the time taken by the DAG-based instruction selector, or the other phases of llc? I am just beginning to dive into LLVM, and I am interested in compile-time reductions that do not reduce code quality dramatically. For example, simply switching on "-fast-isel" (roughly 17% drop in code quality for some of my benchmarks) or "-regalloc=local/fast" (roughly 2x slower for some of my benchmarks) is, unfortunately, not an acceptable option.

I noticed that a semi-recent commit added the "STATISTIC(NumDAGIselRetries,"Number of times dag isel has to try another path");" as well as a sorting change in DAGISelEmitter.cpp. This suggests to me that there is work being done. Are there other similar efforts, but perhaps aiming for bigger gains, that I should be aware of? From the other end, are there efforts to make the fast-path produce better quality code?\

The specific work you mention was aimed at reducing isel's own code
size. Previously, isel used a huge mountain of generated code which took
a long time to compile, and consumed a large amount of memory. The
new isel fixes these problems, but underneath it's using the same
conceptual algorithm, and I believe is about the same speed.

SelectionDAG performance is important. It hasn't seen a lot of
attention recently, as the "-O0" use case has been where most of
the compile-time attention has gone. But speeding up SelectionDAG
isel would also be useful for many different purposes.

Some small things I've noticed:

- LegalizeVectors might not make a single change over an entire input bc file, yet it can take about 6% of the ISel time. Identifying that this can be skipped ahead of time may help?

Sounds reasonable. Perhaps there could be a SelectionDAG-wide flag which
indicates whether any vectors are present.

Another thing that could be done is avoiding running DAGCombine when
it isn't needed. I believe there are a few cases where the code just
reruns DAGCombine even if the preceding Legalize pass doesn't make
any changes.

Longer term, I've also heard some ideas tossed around of merging
LegalizeDAG.cpp and now also LegalizeVectorsOps.cpp with the new
LegalizeTypes infrastructure so that everything would run as a single
pass, which would greatly reduce the amount of "Walk and hash all the
nodes" work that these passes currently do.

Another thing to look at is the underlying FoldingSet data structure.
Right now, node lookup requires a lot of recomputation of hash data.

- We can take the Changed flag returned by LegalizeVectors, and pass it on to Legalize so that it can decide not to AssignTopologicalOrder again, since it was already done in the LegalizeVectors pass. However, this is a negligible reduction: 0.2 seconds shaved off a 30+ second compile =).

Ok.

- BuildSchedGraph() can take around 30% of the X% of time taken by Instruction Scheduling. How different is the scheduler graph from the selection graph?

The scheduling graph is constructed using the selection graph information,
but it does different things. Constants and other nodes which don't correspond
to instructions are excluded, nodes which must be scheduled together are
merged, extra edges can be added, the schedule graph doesn't do automatic
CSE on edge changes, etc. The schedule graph also has data to track
node height and depth, and other information used by the scheduler.

Dan

The fast and local register allocators are meant to be used on unoptimized code, a 'Debug build'. While they do work on optimized code, they do not give good results. Their primary goal is compile time, not code quality.

/jakob

The fast and local register allocators are meant to be used on unoptimized code, a 'Debug build'. While they do work on optimized code, they do not give good results. Their primary goal is compile time, not code quality.

Yes, we have a somewhat uncommon use case. It is fine to spend time
optimizing bitcode (LTO is a OK), but we want to make the final IL ->
Executable translation as fast as possible.

/jakob

Cheers,

Do you know how the fast allocator performs in these conditions? Have you compared it to the local allocator? I really focused my efforts on unoptimized code.

/jakob

Here are some recent stats of the fast vs local vs linear scan at O0 on “opt -std-compile-opts” processed bitcode files. The fast regalloc is still certainly faster at codegen than local with such bitcode files. Let me know if the link doesn’t work:

https://spreadsheets.google.com/a/google.com/ccc?key=0At5EJFcCBf-wdDgtd2FoZjU4bFBzcFBtT25rQkgzMEE&hl=en

Misc stuff: I ran into an “UNREACHABLE executed” using linear scan on revision 104021, so I used an older version for that.

0 llc.hg 0x0000000000af4d7f
1 llc.hg 0x0000000000af54fa
2 libpthread.so.0 0x00007fb1734b67d0
3 libc.so.6 0x00007fb1725d2095 gsignal + 53
4 libc.so.6 0x00007fb1725d3af0 abort + 272
5 llc.hg 0x0000000000ad4932 llvm::llvm_unreachable_internal(char const*, char const*, unsigned int) + 370
6 llc.hg 0x0000000000886426 llvm::LiveIntervals::handleVirtualRegisterDef(llvm::MachineBasicBlock*, llvm::ilist_iteratorllvm::MachineInstr, llvm::SlotIndex, llvm::MachineOperand&, unsigned int, llvm::LiveInterval&) + 3910
7 llc.hg 0x0000000000888429 llvm::LiveIntervals::handleRegisterDef(llvm::MachineBasicBlock*, llvm::ilist_iteratorllvm::MachineInstr, llvm::SlotIndex, llvm::MachineOperand&, unsigned int) + 409
8 llc.hg 0x000000000088ade0 llvm::LiveIntervals::computeIntervals() + 2496
9 llc.hg 0x000000000088b56f llvm::LiveIntervals::runOnMachineFunction(llvm::MachineFunction&) + 447
10 llc.hg 0x00000000007b3493 llvm::MachineFunctionPass::runOnFunction(llvm::Function&) + 115
11 llc.hg 0x0000000000a79ec0 llvm::FPPassManager::runOnFunction(llvm::Function&) + 688
12 llc.hg 0x0000000000a79f13 llvm::FPPassManager::runOnModule(llvm::Module&) + 67
13 llc.hg 0x0000000000a79a63 llvm::MPPassManager::runOnModule(llvm::Module&) + 515
14 llc.hg 0x0000000000a79b42 llvm::PassManagerImpl::run(llvm::Module&) + 114
15 llc.hg 0x0000000000a79bdd llvm::PassManager::run(llvm::Module&) + 13
16 llc.hg 0x00000000004d3112 main + 2802
17 libc.so.6 0x00007fb1725be1c4 __libc_start_main + 244
18 llc.hg 0x00000000004d0b09

Stack dump:
0. Program arguments: llc.hg -asm-verbose=false -O0 32/403.gcc/403.gcc.linked.bc -o 32/403.gcc/output/403.gcc.linked.bc.llc_O0.s

  1. Running pass ‘Function Pass Manager’ on module ‘32/403.gcc/403.gcc.linked.bc’.
  2. Running pass ‘Live Interval Analysis’ on function ‘@nonlocal_mentioned_p
  • Jan

Jan Voung wrote:

Here are some recent stats of the fast vs local vs linear scan at O0 on
"opt -std-compile-opts" processed bitcode files. The fast regalloc is
still certainly faster at codegen than local with such bitcode files.
  Let me know if the link doesn't work:

https://spreadsheets.google.com/a/google.com/ccc?key=0At5EJFcCBf-wdDgtd2FoZjU4bFBzcFBtT25rQkgzMEE&hl=en
<https://spreadsheets.google.com/a/google.com/ccc?key=0At5EJFcCBf-wdDgtd2FoZjU4bFBzcFBtT25rQkgzMEE&hl=en&gt;

Hi Jan,

See the "/a/google.com/" part of the URL? Yeah, people without google.com logins can't read that.

There's no way to use docs to produce a document with your work account that is readable to the outside world, I've complained about it before. Export it instead.

Nick

Here are some recent stats of the fast vs local vs linear scan at O0 on "opt -std-compile-opts" processed bitcode files. The fast regalloc is still certainly faster at codegen than local with such bitcode files. Let me know if the link doesn't work:

https://spreadsheets.google.com/a/google.com/ccc?key=0At5EJFcCBf-wdDgtd2FoZjU4bFBzcFBtT25rQkgzMEE&hl=en

Sorry, can't read that.

Misc stuff: I ran into an "UNREACHABLE executed" using linear scan on revision 104021, so I used an older version for that.

Please use bugpoint to reduce the test case and then submit a PR.

bugpoint -run-llc 403.gcc.linked.bc -tool-args -asm-verbose=false -O0

Thanks,

/jakob

Ah ok, thanks for the tip Nick. Here’s another try:

http://spreadsheets.google.com/ccc?key=0AjRrJHQc4_bddDhwWEhRajNDbWozZjk1YUVMWnZKVFE&hl=en

Misc stuff: I ran into an “UNREACHABLE executed” using linear scan on revision 104021, so I used an older version for that.

Please use bugpoint to reduce the test case and then submit a PR.

bugpoint -run-llc 403.gcc.linked.bc -tool-args -asm-verbose=false -O0

Ok.

  • Jan