FoldingSetNodeID operations inefficiency

Hi Dan,

Thanks for commenting on this topic.
See my comments in-line.

----- Ursprüngliche Mail ----

Von: Dan Gohman <gohman@apple.com>
An: LLVM Developers Mailing List <llvmdev@cs.uiuc.edu>
Gesendet: Mittwoch, den 30. April 2008, 21:38:26 Uhr
Betreff: Re: [LLVMdev] FoldingSetNodeID operations inefficiency

> Hi Chris,
>
> Your were totally right with your suggestion.
>
> I have implemented the code that :
>
> a) does not merge multiple TokenFactor nodes in the
> DAGCombiner::visitTokenFactor(), if the resulting TF node would
> contain more than 64 operands.
>
> b) produces a bunch of TokenFactor nodes with at most 64 operands,
> instead of one huge TokenFactor in the
> SelectionDAGLowering::getRoot().
>
> If we have n pending loads, they can be combined into TFs in two
> ways:
> 1) The first 64 loads are put into a new node TF1. Then TF1 and
> the next 64 loads are put into a new node TF2 and so on. That is,
> each generated TF contains a previous TF as its first operand and a
> bunch of pending loads:
> ___TF2______
> / | \
> TF1 LD2.1 .. LD2.64
> / \
> LD1.1..LD1.64
>
>
> 2) Every 64 loads are put into a new TF node TFi. Then all such
> TFs are put into a big parent TF, that has only these TFi nodes as
> operands:
> _____TF______
> / \
> TF1 ... __TFk__
> / \ / \
> LD1.1..LD1.64 LDk.1...LDk.64
>
>
> These changes (a) and (b) significantely reduce the compilation time
> on my pathological use-cases with huge TokenFactors.
>
> I attach the proposed patch to this mail for review.
>
> The only questions I still have are the following:
> - Which approach is better, b.1 or b.2?

I think b.2 is slightly better than b.1, because b.2 leaves all
the nodes at the same depth/height in the graph.

Good point.

Either approach
will restrict scheduling choices though. With huge numbers of
loads, register pressure could be a prohibitive problem, so it
may be desirable to give the list-burr scheduler as much
flexibility as possible.

OK.

I think it's worthwhile to investigate alternatives before going
too far down the path of splitting TokenFactors.

I totally agree and I tried it.

Do we know where the quadradic FoldingSetNodeID operations are coming from?

Re-computation of the FoldingSetNodeID hashes takes a lot of time for a node with 3200 operands,
since the Bits array contains 6400 elements. Moreover, it looks like a temporary data structure is
allocated for this computation, which also takes time and memory. I still do not quite understand where
the quadratic factor comes from.

Is caching of TokenFactor hash id's effective?

I tried to insert some debugging code to see how efficient are the CSE sets. And my understanding is that we have hits
only for constants, registers and some other "simple" nodes. I have never seen on my tests e.g. any TokenFactor that would
be found in the CSE table and reused. If it is the case in general, may be such nodes should not be put into CSE tables
at all?

>
> - If I have a huge number k of loads/nodes (i.e. nodes N1 ... Nk),
> to be put into a TokenFactor, what is a better order to put them
> into a TF? Is it left-to-right or right-to-left or something else? I
> have experimented with the approaches b.1 and b.2 and it seems that
> depending on the order, the scheduling decisions are affected in a
> positive or negative way.

If you're using the list-burr scheduler (the default on x86)

I do.

then it is currently arbitrary. One way might get more lucky than the other, but
it comes down to the NodeQueueId comparison in many cases.

Yes. I still remember this fact, since I added this code to SVN just few days ago :wink:

In the future I'd like to see the schedulers be able to sort memory
operations by their addresses. Among other things, it would make them
less dependent on arbitrary choices like the order of operands in a
TokenFactor.

Good idea. I was thinking about it as well. Besides introducing a more
deterministic choice, it would also help to achieve a better memory
locality. And this may produce dramatic performance wins for run-time of compiled applications.
I've observed something similar while working on the spill slot coalescing for my register
allocator.

If you're going to be splitting large TokenFactors, then theoretically
you'd want to use the same ordering that the schedulers would use,
such as sorting the loads ascending by addresses or whatever. That
might not be very simple though :-}.

Well, if the addresses are not constants and have different bases, it may be a bit tricky.
Or do you have something specific in mind?

- Roman

      E-Mails jetzt auf Ihrem Handy.
www.yahoo.de/go

Do we know where the quadradic FoldingSetNodeID operations are coming from?

Re-computation of the FoldingSetNodeID hashes takes a lot of time for a node with 3200 operands,
since the Bits array contains 6400 elements. Moreover, it looks like a temporary data structure is
allocated for this computation, which also takes time and memory. I still do not quite understand where
the quadratic factor comes from.

Ok.

Is caching of TokenFactor hash id's effective?

I tried to insert some debugging code to see how efficient are the CSE sets. And my understanding is that we have hits
only for constants, registers and some other "simple" nodes. I have never seen on my tests e.g. any TokenFactor that would
be found in the CSE table and reused. If it is the case in general, may be such nodes should not be put into CSE tables
at all?

One possible explanation for this is SelectionDAG's conservative approach
to memory dependencies, wherein it often has tall sequences of chains in the
dependency graph and avoids many cases where nodes might have common
chain operands. As LLVM gets more aggressive in this area, we may begin to
see TokenFactors getting CSE'd.

I guess even then it won't be very common though.

If you're going to be splitting large TokenFactors, then theoretically
you'd want to use the same ordering that the schedulers would use,
such as sorting the loads ascending by addresses or whatever. That
might not be very simple though :-}.

Well, if the addresses are not constants and have different bases, it may be a bit tricky.

Or do you have something specific in mind?

We don't need the base ordering to match the full runtime addresses' order;
we primarily need to group addresses with common bases together so
that a secondary sort based on ascending offset can be effective.

Dan