Lowering switch statements with PGO

All,
About two months ago I posted a patch that hoisted the hottest case
statement from a switch statement during ISelLowering.

See: http://reviews.llvm.org/D5786

Sean was rather adamant about using a Huffman tree (and I agree this is a
more complete solution), so I'd like to put a patch together. That being
said, I have a few questions.

The current switch lowering code sorts based on case values and is lowered
with a combination of binary trees, lookup tables, bit tests and magic.
If we lower the switch based on branch probabilities, I think the most
reasonable approach would be to disable the lookup tables and stick with a
purely tree structure. Does anyone object or have opinions on this
matter?

Also, is there a way to determine if branch probabilities are based on
real data or static heuristics? I assume we'd only want to build a
Huffman tree if we have real data. Please correct me if I'm wrong.

Comments are very welcome!

Chad

If one case strongly dominates the others, and there are still a large number of unlikely cases, it seems like it would be a good idea to do a branch for the hot case, and then a lookup table for the cold cases. More generally, it makes intuitive sense to me that this kind of switch-peeling is naturally recursive, and we should allow the existing heuristics to fire each time we peel out cases.

—Owen

All,
About two months ago I posted a patch that hoisted the hottest case
statement from a switch statement during ISelLowering.

See: ⚙ D5786 [PGO] Hoist hot case statement from switches

Sean was rather adamant about using a Huffman tree (and I agree this is a
more complete solution), so I'd like to put a patch together.

I think this sounds really cool!

That being
said, I have a few questions.

The current switch lowering code sorts based on case values and is lowered
with a combination of binary trees, lookup tables, bit tests and magic.
If we lower the switch based on branch probabilities, I think the most
reasonable approach would be to disable the lookup tables and stick with a
purely tree structure. Does anyone object or have opinions on this
matter?

Spontaneously I would have thought the Huffman tree would just replace
the binary trees, i.e. we'd use the branch probabilities to build the
tree differently.

For how many hot cases is a Huffman tree faster than a jump table? I
suspect we'll still want to build jump tables when we can.

Thanks,
Hans

All,
About two months ago I posted a patch that hoisted the hottest case
statement from a switch statement during ISelLowering.

See: ⚙ D5786 [PGO] Hoist hot case statement from switches

Sean was rather adamant about using a Huffman tree (and I agree this is
a
more complete solution), so I'd like to put a patch together.

I think this sounds really cool!

That being
said, I have a few questions.

The current switch lowering code sorts based on case values and is
lowered
with a combination of binary trees, lookup tables, bit tests and magic.
If we lower the switch based on branch probabilities, I think the most
reasonable approach would be to disable the lookup tables and stick with
a
purely tree structure. Does anyone object or have opinions on this
matter?

Spontaneously I would have thought the Huffman tree would just replace
the binary trees, i.e. we'd use the branch probabilities to build the
tree differently.

The current implementation selects the pivot in such a way that it optimizes
for lookup tables. Lowering the binary tree based on branch probabilities
doesn't fit well with this approach. However, I think we could start by
building a Huffman tree and once we've handled the most heavily weighted
nodes,
we could fall back to the current logic. Make sense?

For how many hot cases is a Huffman tree faster than a jump table? I
suspect we'll still want to build jump tables when we can.

I take back my previous comment; I agree we should have a hybrid approach,
per my comments above.

> All,
> About two months ago I posted a patch that hoisted the hottest case
> statement from a switch statement during ISelLowering.
>
> See: ⚙ D5786 [PGO] Hoist hot case statement from switches
>
> Sean was rather adamant about using a Huffman tree (and I agree this is a
> more complete solution), so I'd like to put a patch together.

I think this sounds really cool!

> That being
> said, I have a few questions.
>
> The current switch lowering code sorts based on case values and is
lowered
> with a combination of binary trees, lookup tables, bit tests and magic.
> If we lower the switch based on branch probabilities, I think the most
> reasonable approach would be to disable the lookup tables and stick with
a
> purely tree structure. Does anyone object or have opinions on this
> matter?

Spontaneously I would have thought the Huffman tree would just replace
the binary trees, i.e. we'd use the branch probabilities to build the
tree differently.

For how many hot cases is a Huffman tree faster than a jump table? I
suspect we'll still want to build jump tables when we can.

I don't have any data, but I suspect that the relative performance of the
two is highly situation-specific and architecture-specific.

One thing to remember is that a Huffman tree is an *optimal* solution to
the problem that it solves (it is not a heuristic); however, the problem of
switch lowering is not exactly the problem that a Huffman tree solves. So
really we are talking about "how do we model the costs?" (and later "how to
do we optimize over the solution space?") which is a tedious problem.

One place to start might be to determine how much performance a naive
Huffman-based approach leaves on the table (and also *why* it suffers in
particular situations). Maybe just a few special cases on top of a Huffman
tree would give most of the benefit, while still largely retaining the
underlying conceptual simplicity of the Huffman approach (and the model
that it is optimal for).

-- Sean Silva

I think it would be interesting to find out where the best cut-off
between jump table and Huffman tree is. If there's a single hot case,
then checking for that first and then doing a jump table is probably
beneficial. But how many branches does it take before just going
through the jump table is faster? Like Sean says, this probably varies
a lot between architectures.

I suspect that if the whole switch can be lowered to a jump table,
that's the best lowering. Using a Huffman tree sounds like a great
solution when a single jump table doesn't work.

- Hans

A well predicted branch can be better than even a well predicted indirect
branch in lots of cases. At the IR level it allows predicate based
simplification that may be harder. At the machine level, there are
different levels of prediction accuracy, and I think the branch is
essentially always likely to do better in terms of icache.

>> All,
>> About two months ago I posted a patch that hoisted the hottest case
>> statement from a switch statement during ISelLowering.
>>
>> See: ⚙ D5786 [PGO] Hoist hot case statement from switches
>>
>> Sean was rather adamant about using a Huffman tree (and I agree this is
>> a
>> more complete solution), so I'd like to put a patch together.
>
> I think this sounds really cool!
>
>> That being
>> said, I have a few questions.
>>
>> The current switch lowering code sorts based on case values and is
>> lowered
>> with a combination of binary trees, lookup tables, bit tests and magic.
>> If we lower the switch based on branch probabilities, I think the most
>> reasonable approach would be to disable the lookup tables and stick with
>> a
>> purely tree structure. Does anyone object or have opinions on this
>> matter?
>
> Spontaneously I would have thought the Huffman tree would just replace
> the binary trees, i.e. we'd use the branch probabilities to build the
> tree differently.

The current implementation selects the pivot in such a way that it
optimizes
for lookup tables.

This seems like a potential point of friction: for lookup tables (and other
techniques) you want to inherently keep the cases sorted by their values,
but a Huffman tree does not care about the actual values; it only cares
about their relative probabilities.

-- Sean Silva

All,
About two months ago I posted a patch that hoisted the hottest case
statement from a switch statement during ISelLowering.

See: ⚙ D5786 [PGO] Hoist hot case statement from switches

Sean was rather adamant about using a Huffman tree (and I agree this is a
more complete solution), so I'd like to put a patch together. That being
said, I have a few questions.

The current switch lowering code sorts based on case values and is lowered
with a combination of binary trees, lookup tables, bit tests and magic.
If we lower the switch based on branch probabilities, I think the most
reasonable approach would be to disable the lookup tables and stick with a
purely tree structure. Does anyone object or have opinions on this
matter?

Also, is there a way to determine if branch probabilities are based on
real data or static heuristics? I assume we'd only want to build a
Huffman tree if we have real data. Please correct me if I'm wrong.

That sounds like a mistake to me, and there isn’t a way to do it now regardless. If you’re getting block frequencies via heuristics, I would expect most of the cases to be weighted equally. I don’t remember any static heuristics for switches.

>> All,
>> About two months ago I posted a patch that hoisted the hottest case
>> statement from a switch statement during ISelLowering.
>>
>> See: ⚙ D5786 [PGO] Hoist hot case statement from switches
>>
>> Sean was rather adamant about using a Huffman tree (and I agree this
is
>> a
>> more complete solution), so I'd like to put a patch together.
>
> I think this sounds really cool!
>
>> That being
>> said, I have a few questions.
>>
>> The current switch lowering code sorts based on case values and is
>> lowered
>> with a combination of binary trees, lookup tables, bit tests and
magic.
>> If we lower the switch based on branch probabilities, I think the
most
>> reasonable approach would be to disable the lookup tables and stick
with
>> a
>> purely tree structure. Does anyone object or have opinions on this
>> matter?
>
> Spontaneously I would have thought the Huffman tree would just replace
> the binary trees, i.e. we'd use the branch probabilities to build the
> tree differently.

The current implementation selects the pivot in such a way that it
optimizes
for lookup tables.

This seems like a potential point of friction: for lookup tables (and
other
techniques) you want to inherently keep the cases sorted by their values,
but a Huffman tree does not care about the actual values; it only cares
about their relative probabilities.

Exactly! I can think of lots of ways to deal with this "friction", but
none of them sit well. As you mentioned, building a Huffman tree and then
dealing with corner cases would be one approach. Another, which was
Owen's suggestion, would be to peel the hot cases and then fall-back to
the current logic.

>>
>> >> All,
>> >> About two months ago I posted a patch that hoisted the hottest case
>> >> statement from a switch statement during ISelLowering.
>> >>
>> >> See: ⚙ D5786 [PGO] Hoist hot case statement from switches
>> >>
>> >> Sean was rather adamant about using a Huffman tree (and I agree this
>> is
>> >> a
>> >> more complete solution), so I'd like to put a patch together.
>> >
>> > I think this sounds really cool!
>> >
>> >> That being
>> >> said, I have a few questions.
>> >>
>> >> The current switch lowering code sorts based on case values and is
>> >> lowered
>> >> with a combination of binary trees, lookup tables, bit tests and
>> magic.
>> >> If we lower the switch based on branch probabilities, I think the
>> most
>> >> reasonable approach would be to disable the lookup tables and stick
>> with
>> >> a
>> >> purely tree structure. Does anyone object or have opinions on this
>> >> matter?
>> >
>> > Spontaneously I would have thought the Huffman tree would just replace
>> > the binary trees, i.e. we'd use the branch probabilities to build the
>> > tree differently.
>>
>> The current implementation selects the pivot in such a way that it
>> optimizes
>> for lookup tables.
>
>
> This seems like a potential point of friction: for lookup tables (and
> other
> techniques) you want to inherently keep the cases sorted by their values,
> but a Huffman tree does not care about the actual values; it only cares
> about their relative probabilities.

Exactly! I can think of lots of ways to deal with this "friction", but
none of them sit well. As you mentioned, building a Huffman tree and then
dealing with corner cases would be one approach. Another, which was
Owen's suggestion, would be to peel the hot cases and then fall-back to
the current logic.

Another possibility might be to use a hybrid approach.

For example, while building the Huffman tree, propagate various information
up the tree as you build it. Then you can very easily start at the top of
the huffman tree and have all this information available to you (e.g. "peel
off the most probable cases until the relative probabilty has some relation
to the subtree case density").

On the other hand (more likely), you could keep the cases ordered (i.e. not
a Huffman tree) and integrate the branch probability info as the extra data
propagated up the tree (along with other stuff); e.g. the max probability
of any node in the subtree, the total probability sum in the subtree. This
would also allow some highly non-trivial properties, like "maximum number
of consecutive cases observed in each subtree" to be maintained, along with
O(1) access to the root of the subtree containing the consecutive cases
(and if we keep the tree balanced, O(log n) time to extract the consecutive
cases and update all data we are propagating up the tree). This would allow
a really clean lowering algorithm like:

while (we have subtrees with property X)
  extract and handle subtree in a particular way
while (we have subtrees with property Y)
  extract and handle subtree in a different way
etc.

A prototype of this can be done really easily with a non-balanced binary
tree. In order to make it reliably fast the binary tree with need to be
upgraded to a balanced binary tree (red-black, AVL, treap, etc.), which I
will go ahead and volunteer to do (I've actually spent more time than I
care to admit investigating balanced binary trees and their implementation).

For a textbook reference on maintaining extra data in a tree, see e.g.
section 14.2 of CLRS.

-- Sean Silva

>>
>> >> All,
>> >> About two months ago I posted a patch that hoisted the hottest case
>> >> statement from a switch statement during ISelLowering.
>> >>
>> >> See: ⚙ D5786 [PGO] Hoist hot case statement from switches
>> >>
>> >> Sean was rather adamant about using a Huffman tree (and I agree this
>> is
>> >> a
>> >> more complete solution), so I'd like to put a patch together.
>> >
>> > I think this sounds really cool!
>> >
>> >> That being
>> >> said, I have a few questions.
>> >>
>> >> The current switch lowering code sorts based on case values and is
>> >> lowered
>> >> with a combination of binary trees, lookup tables, bit tests and
>> magic.
>> >> If we lower the switch based on branch probabilities, I think the
>> most
>> >> reasonable approach would be to disable the lookup tables and stick
>> with
>> >> a
>> >> purely tree structure. Does anyone object or have opinions on this
>> >> matter?
>> >
>> > Spontaneously I would have thought the Huffman tree would just
replace
>> > the binary trees, i.e. we'd use the branch probabilities to build the
>> > tree differently.
>>
>> The current implementation selects the pivot in such a way that it
>> optimizes
>> for lookup tables.
>
>
> This seems like a potential point of friction: for lookup tables (and
> other
> techniques) you want to inherently keep the cases sorted by their
values,
> but a Huffman tree does not care about the actual values; it only cares
> about their relative probabilities.

Exactly! I can think of lots of ways to deal with this "friction", but
none of them sit well. As you mentioned, building a Huffman tree and then
dealing with corner cases would be one approach. Another, which was
Owen's suggestion, would be to peel the hot cases and then fall-back to
the current logic.

Another possibility might be to use a hybrid approach.

For example, while building the Huffman tree, propagate various
information up the tree as you build it. Then you can very easily start at
the top of the huffman tree and have all this information available to you
(e.g. "peel off the most probable cases until the relative probabilty has
some relation to the subtree case density").

On the other hand (more likely), you could keep the cases ordered (i.e.
not a Huffman tree) and integrate the branch probability info as the extra
data propagated up the tree (along with other stuff); e.g. the max
probability of any node in the subtree, the total probability sum in the
subtree. This would also allow some highly non-trivial properties, like
"maximum number of consecutive cases observed in each subtree" to be
maintained, along with O(1) access to the root of the subtree containing
the consecutive cases (and if we keep the tree balanced, O(log n) time to
extract the consecutive cases and update all data we are propagating up the
tree). This would allow a really clean lowering algorithm like:

while (we have subtrees with property X)
  extract and handle subtree in a particular way
while (we have subtrees with property Y)
  extract and handle subtree in a different way
etc.

A prototype of this can be done really easily with a non-balanced binary
tree. In order to make it reliably fast the binary tree with need to be
upgraded to a balanced binary tree (red-black, AVL, treap, etc.), which I
will go ahead and volunteer to do (I've actually spent more time than I
care to admit investigating balanced binary trees and their implementation).

Actually, since we really only would be mutating the tree in a very
controlled way, we could probably avoid actually ever mutating the
structure of the tree and hence avoid a traditional `struct node *left,
*right;`. Instead, we could use an implicit tree structure like in a binary
heap, which would be naturally balanced and more compact in memory as well.

-- Sean Silva

(this is a total side-note)

>
> All,
> About two months ago I posted a patch that hoisted the hottest case
> statement from a switch statement during ISelLowering.
>
> See: ⚙ D5786 [PGO] Hoist hot case statement from switches
>
> Sean was rather adamant about using a Huffman tree (and I agree this is a
> more complete solution), so I'd like to put a patch together. That being
> said, I have a few questions.
>
> The current switch lowering code sorts based on case values and is
lowered
> with a combination of binary trees, lookup tables, bit tests and magic.
> If we lower the switch based on branch probabilities, I think the most
> reasonable approach would be to disable the lookup tables and stick with
a
> purely tree structure. Does anyone object or have opinions on this
> matter?
>
> Also, is there a way to determine if branch probabilities are based on
> real data or static heuristics? I assume we'd only want to build a
> Huffman tree if we have real data. Please correct me if I'm wrong.

That sounds like a mistake to me, and there isn’t a way to do it now
regardless. If you’re getting block frequencies via heuristics, I would
expect most of the cases to be weighted equally. I don’t remember any
static heuristics for switches.

I would expect that any good algorithm for PGO lowering would be good for
non-PGO if you set all probabilities to 0. We technically aren't really
interested in probabilities, but something more like "amount of observed
occurrence" (the term "probability" implies that the union of all the
probabilities must exhaust the entire space of possibilities; obviously not
true if the every case is 0 "probability"). The input to the PGO lowering
is more generally just a set of observations or a sample of observations,
so it must make sense to have all cases with 0 "amount of observed
occurrence".

E.g. you may be looking at pc samples and looking at how often your PC
sample lands in the text of a particular case. When we do PGO lowering, we
do "have data" on that switch, even if we collect no samples in it (in the
same sense that an empty vector "contains" 0 elements). However, we could
just as easily have perhaps gotten one sample in the switch by luck. It
seems very prone to issues to have a discontinuity in our lowering strategy
(i.e. in PGO codegen, treating "no samples" as "doesn't have PGO data and
so needs a different lowering strategy") based on a continuous factor like
the statistical sense in which we might pick up samples for a particular
switch.

-- Sean Silva

(Still off-topic...)

I'm not convinced. I imagine there's enough code like this:

    int state = ...;
    switch (state) {
    default:
      return make_error("unexpected state", state);
    case StateStart:
      // ...
      break;
    case StateA:
      // ...
      break;
    case StateB:
      // ...
      break;
    case StateC:
      // ...
      break;
    }

that the heuristic could do a fair bit of harm.

I'm not sure you can do much good without a profile.

After messing around with this a bit recently, I’ve come to the following conclusions:

  1. One issue is at what granularity to consider the PGO weightings. For example, should a set of 50 consecutive cases with 65% of the total weight distributed approximately evenly and which can be lowered to a table lookup be handled before the 3 remaining cases that 5%, 10%, and 20% probability, respectively?

  2. Rather than saying “hot case” to mean the case with the most weight, we might want to talk about a different “unit” in which to aggregate the weights (such as in 2., considering the whole lookup table as one).

  3. The primary advantages of a tree approach I suggested are likely to surround being able to have a more “global” picture of the set of cases at each step in the lowering and do all the computation up front.

In a larger scope, I’m convinced that in order to make appropriate tradeoffs, we probably need to analyze lots of real-world code. For example:

  • some use cases of switch are extremely performance-critical, like lexers/interpreters, while at the other end of the spectrum things like enum->string conversions for printing, are less-so (and should probably be optimized for size).
  • some uses of switches do very different things in each of their case destinations (and use fallthrough, which must be exploited properly by the lowering), while others have very similar code in their destinations (the entire switch might actually be describing a mapping on data, rather than a control flow construct per se; e.g. a table lookup (dense indices) or binary search in a static table (sparse indices) might be a great lowering).
  • some switches have multiple case values that map to each destination while others have a 1-1 mapping.

Those are some ways that switch’s can vary off the top of my head. I think it would be really valuable if somebody who has easy access to compiling a large amount of real-world code with a custom compiler and collect some data (my understanding is that Google’s internal codebase is a good candidate).

– Sean Silva

After messing around with this a bit recently, I've come to the following
conclusions:

1. One issue is at what granularity to consider the PGO weightings. For
example, should a set of 50 consecutive cases with 65% of the total weight
distributed approximately evenly and which can be lowered to a table
lookup
be handled before the 3 remaining cases that 5%, 10%, and 20% probability,
respectively?

It seems reasonable to consider the lookup table as the 'hottest' case
(per your next comment).

2. Rather than saying "hot case" to mean the case with the most weight, we
might want to talk about a different "unit" in which to aggregate the
weights (such as in 2., considering the whole lookup table as one).

I agree with this point.

3. The primary advantages of a tree approach I suggested are likely to
surround being able to have a more "global" picture of the set of cases at
each step in the lowering and do all the computation up front.

In a larger scope, I'm convinced that in order to make appropriate
tradeoffs, we probably need to analyze lots of real-world code.

Agreed.

For example:

- some use cases of `switch` are extremely performance-critical, like
lexers/interpreters, while at the other end of the spectrum things like
enum->string conversions for printing, are less-so (and should probably be
optimized for size).
- some uses of switches do very different things in each of their case
destinations (and use fallthrough, which must be exploited properly by the
lowering), while others have very similar code in their destinations (the
entire switch might actually be describing a mapping on data, rather than
a
control flow construct per se; e.g. a table lookup (dense indices) or
binary search in a static table (sparse indices) might be a great
lowering).
- some switches have multiple case values that map to each destination
while others have a 1-1 mapping.

Those are some ways that switch's can vary off the top of my head. I think
it would be really valuable if somebody who has easy access to compiling a
large amount of real-world code with a custom compiler and collect some
data (my understanding is that Google's internal codebase is a good
candidate).

(Sadly) My world consists of SPEC2K and SPEC2K6, so I don't have must to
offer here. :confused:

> After messing around with this a bit recently, I've come to the following
> conclusions:
>
> 1. One issue is at what granularity to consider the PGO weightings. For
> example, should a set of 50 consecutive cases with 65% of the total
weight
> distributed approximately evenly and which can be lowered to a table
> lookup
> be handled before the 3 remaining cases that 5%, 10%, and 20%
probability,
> respectively?

It seems reasonable to consider the lookup table as the 'hottest' case
(per your next comment).

> 2. Rather than saying "hot case" to mean the case with the most weight,
we
> might want to talk about a different "unit" in which to aggregate the
> weights (such as in 2., considering the whole lookup table as one).

I agree with this point.

> 3. The primary advantages of a tree approach I suggested are likely to
> surround being able to have a more "global" picture of the set of cases
at
> each step in the lowering and do all the computation up front.
>
> In a larger scope, I'm convinced that in order to make appropriate
> tradeoffs, we probably need to analyze lots of real-world code.

Agreed.

> For example:
>
> - some use cases of `switch` are extremely performance-critical, like
> lexers/interpreters, while at the other end of the spectrum things like
> enum->string conversions for printing, are less-so (and should probably
be
> optimized for size).
> - some uses of switches do very different things in each of their case
> destinations (and use fallthrough, which must be exploited properly by
the
> lowering), while others have very similar code in their destinations (the
> entire switch might actually be describing a mapping on data, rather than
> a
> control flow construct per se; e.g. a table lookup (dense indices) or
> binary search in a static table (sparse indices) might be a great
> lowering).
> - some switches have multiple case values that map to each destination
> while others have a 1-1 mapping.
>
> Those are some ways that switch's can vary off the top of my head. I
think
> it would be really valuable if somebody who has easy access to compiling
a
> large amount of real-world code with a custom compiler and collect some
> data (my understanding is that Google's internal codebase is a good
> candidate).
>

(Sadly) My world consists of SPEC2K and SPEC2K6, so I don't have must to
offer here. :confused:

SPEC might not be too bad for this particular investigation (or at least
one part of it), since it hits a lot of the main "hot" situations that we
care about (lexers, compression, interpreters, etc.). It will likely
underrepresent the importance of many of the "cold" switches where we
likely want focus on code size and not polluting the i-cache.

Regardless, I would say: the more data the better!

-- Sean Silva