Buildbots timing out on full builds

Hi,

We've noticed that recently some of our bots (mostly
clang-cmake-armv7-a15 and clang-cmake-thumbv7-a15) started timing out
whenever someone commits a change to TableGen:

r303418: http://lab.llvm.org:8011/builders/clang-cmake-armv7-a15/builds/7268
r303346: http://lab.llvm.org:8011/builders/clang-cmake-armv7-a15/builds/7242
r303341: http://lab.llvm.org:8011/builders/clang-cmake-armv7-a15/builds/7239
r303259: http://lab.llvm.org:8011/builders/clang-cmake-armv7-a15/builds/7198

TableGen changes before that (I checked about 3-4 of them) don't have
this problem:
r303253: http://lab.llvm.org:8011/builders/clang-cmake-armv7-a15/builds/7197

That one in particular actually finishes the whole build in 635s,
which is only a bit over 50% of the timeout limit (1200s). So, between
r303253 and now, something happened that made full builds
significantly slower. Does anyone have any idea what that might have
been? Also, has anyone noticed this on other bots?

Thanks,
Diana

r303259 will have increased compile-time since it tripled the number of importable
SelectionDAG rules but a quick measurement building the affected file:
    ninja lib/Target/<Target>/CMakeFiles/LLVM<Target>CodeGen.dir/<Target>InstructionSelector.cpp.o
for both ARM and AArch64 didn't show a significant increase. I'll check whether
it made a different to linking.

sanitizer-x86_64-linux-fast also timed out after one of my commits this morning.

r303259 will have increased compile-time since it tripled the number of importable
SelectionDAG rules but a quick measurement building the affected file:
   ninja lib/Target/<Target>/CMakeFiles/LLVM<Target>CodeGen.dir/<Target>InstructionSelector.cpp.o
for both ARM and AArch64 didn't show a significant increase. I'll check whether
it made a different to linking.

I don't think it's r303259. Starting with a fully built r303259, then updating to r303258 and running 'ninja' gives me:
  real 2m28.273s
  user 13m23.171s
  sys 0m47.725s
then updating to r303259 and running 'ninja' again gives me:
  real 2m19.052s
  user 13m38.802s
  sys 0m44.551s

Ok, thanks. I'll try to do a bisect next week to see if I can find it.

Cheers,
Diana

It must be r303341, I commented on corresponding llvm-commits thread.

r303341 is the re-commit of the r303259 which tripled the number of rules that can be imported into GlobalISel from SelectionDAG. A compile time regression is to be expected but when I looked into it I found it was ~25s on my machine for the whole incremental build rather than the ~12mins you are seeing. I’ll take another look.

I’m aware of a couple easy improvements we could make to the way the importer works. I was leaving them until we change it over to a state machine but the most obvious is to group rules by their top-level gMIR instruction. This would reduce the cost of the std::sort that handles the rule priorities in generating the source file and will also make it simpler for the compiler to compile it.

Hi Daniel,

I did your experiment on a TK1 machine (same as the bots) and for r303258 I get:
real 18m28.882s
user 35m37.091s
sys 0m44.726s

and for r303259:
real 50m52.048s
user 88m25.473s
sys 0m46.548s

If I can help investigate, please let me know, otherwise we can just
try your fixes and see how they affect compilation time.

Thanks,
Diana

Is that with -fsanitize=memory too?

I'm currently building ToT with r303258 reverted. Once that's done I'll commit the revert and start investigating fixes.

Nope, no sanitizers.

Could you give https://reviews.llvm.org/differential/diff/99949/ a try? It brings back the reverted commit and fixes two significant compile-time issues. Assuming it works for you too, I’ll finish off the patches and post them individually.

The first one removes the single-use lambdas in the generated code. These turn out to be really expensive. Replacing them with equivalent gotos saves 11 million allocations (~57%) during the course of compiling AArch64InstructionSelector.cpp.o. The cumulative number of bytes allocated also drops by ~4GB (~36%).

The second one is to split up the functions by the number of operands in the top-level instruction. This constrains the scale of the task the register allocator needs to deal with in X86InstructionSelection.cpp.o.

Could you give https://reviews.llvm.org/differential/diff/99949/ a try? It brings back the reverted commit and fixes two significant compile-time issues. Assuming it works for you too, I’ll finish off the patches and post them individually.

The first one removes the single-use lambdas in the generated code. These turn out to be really expensive. Replacing them with equivalent gotos saves 11 million allocations (~57%) during the course of compiling AArch64InstructionSelector.cpp.o. The cumulative number of bytes allocated also drops by ~4GB (~36%).

(this is outside my wheelhouse, so just as an aside): Could you explain further what aspect of the change was that saved allocations? Lambdas themselves don’t allocate memory (std::function of a stateful lambda may allocate memory - but I didn’t see any std::function in your change, though I might’ve missed it), so I’m guessing it’s something else/some other aspect of the code in/outside the lambdas and where it moved that changed the allocation pattern?

Could you give https://reviews.llvm.org/differential/diff/99949/ a try? It brings back the reverted commit and fixes two significant compile-time issues. Assuming it works for you too, I’ll finish off the patches and post them individually.

The first one removes the single-use lambdas in the generated code. These turn out to be really expensive. Replacing them with equivalent gotos saves 11 million allocations (~57%) during the course of compiling AArch64InstructionSelector.cpp.o. The cumulative number of bytes allocated also drops by ~4GB (~36%).

(this is outside my wheelhouse, so just as an aside): Could you explain further what aspect of the change was that saved allocations? Lambdas themselves don’t allocate memory (std::function of a stateful lambda may allocate memory - but I didn’t see any std::function in your change, though I might’ve missed it), so I’m guessing it’s something else/some other aspect of the code in/outside the lambdas and where it moved that changed the allocation pattern?

My tools don’t tell me which allocations no longer occur but compiling the lambdas requires memory allocations. I believe the allocations come from the various optimization and analysis passes, SelectionDAG, MC, etc. for each of the MachineFunction’s corresponding to the lambdas.

Could you give https://reviews.llvm.org/differential/diff/99949/ a try? It brings back the reverted commit and fixes two significant compile-time issues. Assuming it works for you too, I’ll finish off the patches and post them individually.

The first one removes the single-use lambdas in the generated code. These turn out to be really expensive. Replacing them with equivalent gotos saves 11 million allocations (~57%) during the course of compiling AArch64InstructionSelector.cpp.o. The cumulative number of bytes allocated also drops by ~4GB (~36%).

(this is outside my wheelhouse, so just as an aside): Could you explain further what aspect of the change was that saved allocations? Lambdas themselves don’t allocate memory (std::function of a stateful lambda may allocate memory - but I didn’t see any std::function in your change, though I might’ve missed it), so I’m guessing it’s something else/some other aspect of the code in/outside the lambdas and where it moved that changed the allocation pattern?

My tools don’t tell me which allocations no longer occur but compiling the lambdas requires memory allocations. I believe the allocations come from the various optimization and analysis passes, SelectionDAG, MC, etc. for each of the MachineFunction’s corresponding to the lambdas.

Ah, sorry, I see - compile time performance, not the performance of the generated code. Thanks for the correction/explanation!

Hi Daniel,

I built r303542, then applied your patch and built again and it still takes
real 34m30.279s
user 84m36.553s
sys 0m58.372s

This is better than the 50m I saw before, but I think we should try to
make it a bit faster. Do you have any other ideas to make it work?

Thanks,
Diana

Thanks for trying that patch. I agree that 34 mins still isn't good enough but we're heading in the right direction.

Changing the partitioning predicate to the instruction opcode rather than the number of operands in the top-level instruction will hopefully cut it down further. I also have a patch that shaves a small amount off of the compile-time by replacing the various LLT::scalar()/LLT::vector() calls with references to LLT objects that were created in advance. I tried something similar with the getRegBankForRegClass() but I haven't written that as a patch yet since that one requires some refactors to get access to a mapping that RegisterBankEmitter.cpp knows. In my experiment I edited this information into AArchGenGlobalISel.inc by hand.

I think the real solution is to convert the generated C++ to the state-machine that we intended to end up with. I don't think we'll be able to put it off much longer given that we're hitting compile-time problems when we can only import 25% of the rules. That said, I have a couple more nearly-finished patches I'd like to get in before we introduce the state machine. Hopefully, the above tricks will be enough to save me a re-write.

Ok, that sounds reasonable. I'm happy to test more patches for you
when they're ready.

Hi Diana and Vitaly,

Could you give https://reviews.llvm.org/differential/diff/100829/ a try? When measuring the compile of AArch64InstructionSelector.cpp.o with asan enabled and running under instruments’s Allocation profiler, my machine reports that the cumulative memory allocations is down to ~3.5GB (was ~10GB), the number of allocations down to ~4 million (was ~23 million), and the compile time down to ~15s (was ~60s).

The patch is based on r303542 and the main change is that most of the generated C++ has been replaced with a state-machine based implementation. It’s not fully converted to a state-machine yet since it generates lots of smaller machines (one matcher and one emitter per rule) instead of a single machine but it’s hopefully sufficient to unblock my patch series.

Hi,

This runs in:
real 13m6.296s
user 42m45.191s
sys 1m2.030s

(on top of a fully built r303542). It should be fine for the ARM bots.

However, you need to ‘return std::move(M)’ at line 1884.

@Vitaly, is it ok for your bots as well?

Cheers,
Diana

Great! I expect I’ll be able to cut it down further once I start fusing these smaller state-machines together. Before that, I’ll re-order the patches that went into that diff so that I don’t have to re-commit the regression before fixing it.

Is https://reviews.llvm.org/differential/diff/100829/ replacement for r303341?

If so LGTM.

r303542 msan AArch64InstructionSelector.cpp: 1m17.209s

r303542+diff/100829/ msan AArch64InstructionSelector.cpp: 1m24.724s