I have made an attempt of implementing an optimal variant of the register bank selector (regbankselect). The code is available for review at https://reviews.llvm.org/D90304, and I would greatly appreciate if anyone interested can provide their comments. I have run a few tests the regbankselect-*.mir testcases for AAarch64 and it seems to work, but more tests are surely needed to increase confidence in the implementation. I also tried using AMDGPU, but that backend does not provide the full list of InstructionMappings for a given MachineInstr, which is needed in order to compute the optimal selection of register banks.
Gabriel Hjort Åkerlund
thank you so much for doing this! I’ll try out the patch in our downstream implementation right away. Do you know how big of an impact this has on compile time compared to fast and greedy?
(Sorry for spamming; the first reply only went to you and not the list.)
Cool that you want to try it out! There will for sure be some bugs in it, so please let me know if/when you find one and I’ll fix it. And if you could make a testcase out of it, that would be superb (as there’s currently a complete lack of tests).
Although I haven’t measured it, I expect the compilation time to take about 3x more compared to greedy as it makes three passes over the instructions. However, the most amount of work is done in the first pass, which is comparable to the pass made in greedy. So hopefully it’s less than 3x, but this should really be measured over a set of functions to get an accurate figure. Also, there will most likely be improvements to be had to decrease compilation time.
it's cool that you try to do this, but I think you need some more
proof before calling this optimal
Put briefly, dynamic programming approaches for this kind of
mapping/matching can be optimal if the underlying dependency structure
is a tree. If it's a more general DAG, optimality goes out of the
window. In your dynamic programming approach, my understanding is that
you remember the cost of every possible "realization"/"mapping" of
every node individually. So, you end up computing N*M pieces of
information, where N is the number of nodes and M is the number of
options to choose from at each node.
Unfortunately, the final selection of choices only works in a tree,
because you basically have to choose each node's
"realization"/"mapping" based on a single successor.
As soon as you're in a general DAG, where a node has to satisfy
*multiple* successors, that no longer works. You could theoretically
extend the dynamic programming approach, but only by computing
information about the cost of *correlated* choices for multiple nodes
simultaneously. But then, your performance can go out of the window
because you need to compute something like N*M^K pieces of
information, where K is a bound on the number of nodes whose choices
you need to consider simultaneously. K cannot be bounded in general
(other than by some trivial function of the number of nodes in the
graph), which means you end up with an exponential worst case if you
want to solve this optimally. I would actually expect optimal register
bank selection on a general DAG to be NP-complete, but I haven't
thought about it too deeply.
Thanks for your response!
Fair enough, I will rename it to "global" as that is a property that we can all agree on. And if it happens to compute the optimal selection, then that's just a bonus. =)