FoldingSetNodeID operations inefficiency

Hi Chris,

Your were totally right with your suggestion.

I have implemented the code that :

a) does not merge multiple TokenFactor nodes in the DAGCombiner::visitTokenFactor(), if the resulting TF node would contain more than 64 operands.

b) produces a bunch of TokenFactor nodes with at most 64 operands,
instead of one huge TokenFactor in the SelectionDAGLowering::getRoot().

   If we have n pending loads, they can be combined into TFs in two ways:
   1) The first 64 loads are put into a new node TF1. Then TF1 and the next 64 loads are put into a new node TF2 and so on. That is, each generated TF contains a previous TF as its first operand and a bunch of pending loads:
      ___TF2______
     / | \
  TF1 LD2.1 .. LD2.64
  / \
LD1.1..LD1.64

   2) Every 64 loads are put into a new TF node TFi. Then all such TFs are put into a big parent TF, that has only these TFi nodes as operands:
      _____TF______
     / \
  TF1 ... __TFk__
  / \ / \
LD1.1..LD1.64 LDk.1...LDk.64

These changes (a) and (b) significantely reduce the compilation time on my pathological use-cases with huge TokenFactors.

I attach the proposed patch to this mail for review.

The only questions I still have are the following:
- Which approach is better, b.1 or b.2?
- If I have a huge number k of loads/nodes (i.e. nodes N1 ... Nk), to be put into a TokenFactor, what is a better order to put them into a TF? Is it left-to-right or right-to-left or something else? I have experimented with the approaches b.1 and b.2 and it seems that depending on the order, the scheduling decisions are affected in a positive or negative way.

I do ask these questions, because I do not have a clear understanding of TokenFactors, how they are processed and how they affect the scheduling and code generation. Therefore any help is highly appreciated.

Thanks,
   Roman

----- Ursprüngliche Mail ----

TokenFactor.patch (4.89 KB)

Hi Chris,

Your were totally right with your suggestion.

I have implemented the code that :

a) does not merge multiple TokenFactor nodes in the DAGCombiner::visitTokenFactor(), if the resulting TF node would contain more than 64 operands.

b) produces a bunch of TokenFactor nodes with at most 64 operands,
instead of one huge TokenFactor in the SelectionDAGLowering::getRoot().

  If we have n pending loads, they can be combined into TFs in two ways:
  1) The first 64 loads are put into a new node TF1. Then TF1 and the next 64 loads are put into a new node TF2 and so on. That is, each generated TF contains a previous TF as its first operand and a bunch of pending loads:
     ___TF2______
    / | \
TF1 LD2.1 .. LD2.64
/ \
LD1.1..LD1.64

  2) Every 64 loads are put into a new TF node TFi. Then all such TFs are put into a big parent TF, that has only these TFi nodes as operands:
     _____TF______
    / \
TF1 ... __TFk__
/ \ / \
LD1.1..LD1.64 LDk.1...LDk.64

These changes (a) and (b) significantely reduce the compilation time on my pathological use-cases with huge TokenFactors.

I attach the proposed patch to this mail for review.

The only questions I still have are the following:
- Which approach is better, b.1 or b.2?

I think b.2 is slightly better than b.1, because b.2 leaves all
the nodes at the same depth/height in the graph. Either approach
will restrict scheduling choices though. With huge numbers of
loads, register pressure could be a prohibitive problem, so it
may be desirable to give the list-burr scheduler as much
flexibility as possible.

I think it's worthwhile to investigate alternatives before going
too far down the path of splitting TokenFactors. Do we know
where the quadradic FoldingSetNodeID operations are coming from?
Is caching of TokenFactor hash id's effective?

- If I have a huge number k of loads/nodes (i.e. nodes N1 ... Nk), to be put into a TokenFactor, what is a better order to put them into a TF? Is it left-to-right or right-to-left or something else? I have experimented with the approaches b.1 and b.2 and it seems that depending on the order, the scheduling decisions are affected in a positive or negative way.

If you're using the list-burr scheduler (the default on x86) then it is
currently arbitrary. One way might get more lucky than the other, but
it comes down to the NodeQueueId comparison in many cases.

In the future I'd like to see the schedulers be able to sort memory
operations by their addresses. Among other things, it would make them
less dependent on arbitrary choices like the order of operands in a
TokenFactor.

If you're going to be splitting large TokenFactors, then theoretically
you'd want to use the same ordering that the schedulers would use,
such as sorting the loads ascending by addresses or whatever. That
might not be very simple though :-}.

Dan