We've noticed that recently some of our bots (mostly
clang-cmake-armv7-a15 and clang-cmake-thumbv7-a15) started timing out
whenever someone commits a change to TableGen:
That one in particular actually finishes the whole build in 635s,
which is only a bit over 50% of the timeout limit (1200s). So, between
r303253 and now, something happened that made full builds
significantly slower. Does anyone have any idea what that might have
been? Also, has anyone noticed this on other bots?
r303259 will have increased compile-time since it tripled the number of importable
SelectionDAG rules but a quick measurement building the affected file:
ninja lib/Target/<Target>/CMakeFiles/LLVM<Target>CodeGen.dir/<Target>InstructionSelector.cpp.o
for both ARM and AArch64 didn't show a significant increase. I'll check whether
it made a different to linking.
sanitizer-x86_64-linux-fast also timed out after one of my commits this morning.
r303259 will have increased compile-time since it tripled the number of importable
SelectionDAG rules but a quick measurement building the affected file:
ninja lib/Target/<Target>/CMakeFiles/LLVM<Target>CodeGen.dir/<Target>InstructionSelector.cpp.o
for both ARM and AArch64 didn't show a significant increase. I'll check whether
it made a different to linking.
I don't think it's r303259. Starting with a fully built r303259, then updating to r303258 and running 'ninja' gives me:
real 2m28.273s
user 13m23.171s
sys 0m47.725s
then updating to r303259 and running 'ninja' again gives me:
real 2m19.052s
user 13m38.802s
sys 0m44.551s
r303341 is the re-commit of the r303259 which tripled the number of rules that can be imported into GlobalISel from SelectionDAG. A compile time regression is to be expected but when I looked into it I found it was ~25s on my machine for the whole incremental build rather than the ~12mins you are seeing. I’ll take another look.
I’m aware of a couple easy improvements we could make to the way the importer works. I was leaving them until we change it over to a state machine but the most obvious is to group rules by their top-level gMIR instruction. This would reduce the cost of the std::sort that handles the rule priorities in generating the source file and will also make it simpler for the compiler to compile it.
Could you give https://reviews.llvm.org/differential/diff/99949/ a try? It brings back the reverted commit and fixes two significant compile-time issues. Assuming it works for you too, I’ll finish off the patches and post them individually.
The first one removes the single-use lambdas in the generated code. These turn out to be really expensive. Replacing them with equivalent gotos saves 11 million allocations (~57%) during the course of compiling AArch64InstructionSelector.cpp.o. The cumulative number of bytes allocated also drops by ~4GB (~36%).
The second one is to split up the functions by the number of operands in the top-level instruction. This constrains the scale of the task the register allocator needs to deal with in X86InstructionSelection.cpp.o.
Could you give https://reviews.llvm.org/differential/diff/99949/ a try? It brings back the reverted commit and fixes two significant compile-time issues. Assuming it works for you too, I’ll finish off the patches and post them individually.
The first one removes the single-use lambdas in the generated code. These turn out to be really expensive. Replacing them with equivalent gotos saves 11 million allocations (~57%) during the course of compiling AArch64InstructionSelector.cpp.o. The cumulative number of bytes allocated also drops by ~4GB (~36%).
(this is outside my wheelhouse, so just as an aside): Could you explain further what aspect of the change was that saved allocations? Lambdas themselves don’t allocate memory (std::function of a stateful lambda may allocate memory - but I didn’t see any std::function in your change, though I might’ve missed it), so I’m guessing it’s something else/some other aspect of the code in/outside the lambdas and where it moved that changed the allocation pattern?
Could you give https://reviews.llvm.org/differential/diff/99949/ a try? It brings back the reverted commit and fixes two significant compile-time issues. Assuming it works for you too, I’ll finish off the patches and post them individually.
The first one removes the single-use lambdas in the generated code. These turn out to be really expensive. Replacing them with equivalent gotos saves 11 million allocations (~57%) during the course of compiling AArch64InstructionSelector.cpp.o. The cumulative number of bytes allocated also drops by ~4GB (~36%).
(this is outside my wheelhouse, so just as an aside): Could you explain further what aspect of the change was that saved allocations? Lambdas themselves don’t allocate memory (std::function of a stateful lambda may allocate memory - but I didn’t see any std::function in your change, though I might’ve missed it), so I’m guessing it’s something else/some other aspect of the code in/outside the lambdas and where it moved that changed the allocation pattern?
My tools don’t tell me which allocations no longer occur but compiling the lambdas requires memory allocations. I believe the allocations come from the various optimization and analysis passes, SelectionDAG, MC, etc. for each of the MachineFunction’s corresponding to the lambdas.
Could you give https://reviews.llvm.org/differential/diff/99949/ a try? It brings back the reverted commit and fixes two significant compile-time issues. Assuming it works for you too, I’ll finish off the patches and post them individually.
The first one removes the single-use lambdas in the generated code. These turn out to be really expensive. Replacing them with equivalent gotos saves 11 million allocations (~57%) during the course of compiling AArch64InstructionSelector.cpp.o. The cumulative number of bytes allocated also drops by ~4GB (~36%).
(this is outside my wheelhouse, so just as an aside): Could you explain further what aspect of the change was that saved allocations? Lambdas themselves don’t allocate memory (std::function of a stateful lambda may allocate memory - but I didn’t see any std::function in your change, though I might’ve missed it), so I’m guessing it’s something else/some other aspect of the code in/outside the lambdas and where it moved that changed the allocation pattern?
My tools don’t tell me which allocations no longer occur but compiling the lambdas requires memory allocations. I believe the allocations come from the various optimization and analysis passes, SelectionDAG, MC, etc. for each of the MachineFunction’s corresponding to the lambdas.
Ah, sorry, I see - compile time performance, not the performance of the generated code. Thanks for the correction/explanation!
Thanks for trying that patch. I agree that 34 mins still isn't good enough but we're heading in the right direction.
Changing the partitioning predicate to the instruction opcode rather than the number of operands in the top-level instruction will hopefully cut it down further. I also have a patch that shaves a small amount off of the compile-time by replacing the various LLT::scalar()/LLT::vector() calls with references to LLT objects that were created in advance. I tried something similar with the getRegBankForRegClass() but I haven't written that as a patch yet since that one requires some refactors to get access to a mapping that RegisterBankEmitter.cpp knows. In my experiment I edited this information into AArchGenGlobalISel.inc by hand.
I think the real solution is to convert the generated C++ to the state-machine that we intended to end up with. I don't think we'll be able to put it off much longer given that we're hitting compile-time problems when we can only import 25% of the rules. That said, I have a couple more nearly-finished patches I'd like to get in before we introduce the state machine. Hopefully, the above tricks will be enough to save me a re-write.
Could you give https://reviews.llvm.org/differential/diff/100829/ a try? When measuring the compile of AArch64InstructionSelector.cpp.o with asan enabled and running under instruments’s Allocation profiler, my machine reports that the cumulative memory allocations is down to ~3.5GB (was ~10GB), the number of allocations down to ~4 million (was ~23 million), and the compile time down to ~15s (was ~60s).
The patch is based on r303542 and the main change is that most of the generated C++ has been replaced with a state-machine based implementation. It’s not fully converted to a state-machine yet since it generates lots of smaller machines (one matcher and one emitter per rule) instead of a single machine but it’s hopefully sufficient to unblock my patch series.
Great! I expect I’ll be able to cut it down further once I start fusing these smaller state-machines together. Before that, I’ll re-order the patches that went into that diff so that I don’t have to re-commit the regression before fixing it.