Your were totally right with your suggestion.
I have implemented the code that :
a) does not merge multiple TokenFactor nodes in the DAGCombiner::visitTokenFactor(), if the resulting TF node would contain more than 64 operands.
b) produces a bunch of TokenFactor nodes with at most 64 operands,
instead of one huge TokenFactor in the SelectionDAGLowering::getRoot().
If we have n pending loads, they can be combined into TFs in two ways:
1) The first 64 loads are put into a new node TF1. Then TF1 and the next 64 loads are put into a new node TF2 and so on. That is, each generated TF contains a previous TF as its first operand and a bunch of pending loads:
/ | \
TF1 LD2.1 .. LD2.64
2) Every 64 loads are put into a new TF node TFi. Then all such TFs are put into a big parent TF, that has only these TFi nodes as operands:
TF1 ... __TFk__
/ \ / \
These changes (a) and (b) significantely reduce the compilation time on my pathological use-cases with huge TokenFactors.
I attach the proposed patch to this mail for review.
The only questions I still have are the following:
- Which approach is better, b.1 or b.2?
- If I have a huge number k of loads/nodes (i.e. nodes N1 ... Nk), to be put into a TokenFactor, what is a better order to put them into a TF? Is it left-to-right or right-to-left or something else? I have experimented with the approaches b.1 and b.2 and it seems that depending on the order, the scheduling decisions are affected in a positive or negative way.
I do ask these questions, because I do not have a clear understanding of TokenFactors, how they are processed and how they affect the scheduling and code generation. Therefore any help is highly appreciated.
----- Ursprüngliche Mail ----
TokenFactor.patch (4.89 KB)