Less aggressive on the first allocation of CSR if detecting an early exit

When compiling C code below for AArach64, I saw that shrink-wrapping didn’t happen due to the very early uses of CSRs in the entry block. So CSR spills/reloads are executed even when the early exit block is taken.

int getI(int i);

int foo(int *P, int i) {

if (i>0)

return P[i];

i = getI(i);

return P[i];

}

It’s not that hard to find such cases where RegAllocGreedy aggressively allocates a CSRs when a live range expands across a call-site. That’s because of the conservatively initialized CSRCost, causing RegAllocGreedy to strongly favour allocating a CSR over splitting a region. Since allocation of CSRs requires the cost of spilling CSRs, allocating CSRs is not always beneficial. Like the case above, if a function has an early exit code, we may want to be less aggressive on the first allocation of CSR in the entry block by increasing the CSRCost.

Previously, I proposed https://reviews.llvm.org/D34608 in this matter, but the way I detect the profitable cases and the way I increase the CRSCost was somewhat unclear. Now, I’m thinking to less aggressive on the first allocation of CSR in the entry block in case where the function has an early exit so that encourage more shrink-wrapping and avoid executing CSR spill/recover when the early exit is taken. By sending this out, I just want to get any high level feedback early. Please let me know if anyone has any opinion about this.

Thanks,
Jun

So the heuristic will have nothing to do with the presence of calls? Might this increase spilling in loops? -Hal

When compiling C code below for AArach64, I saw that shrink-wrapping
didn't happen due to the very early uses of CSRs in the entry block.
So CSR spills/reloads are executed even when the early exit block is
taken.

int getI(int i);

int foo(int *P, int i) {

if (i>0)

return P[i];

i = getI(i);

return P[i];

}

It's not that hard to find such cases where RegAllocGreedy
aggressively allocates a CSRs when a live range expands across a
call-site. That's because of the conservatively initialized
CSRCost, causing RegAllocGreedy to strongly favour allocating a CSR
over splitting a region. Since allocation of CSRs requires the cost
of spilling CSRs, allocating CSRs is not always beneficial. Like the
case above, if a function has an early exit code, we may want to be
less aggressive on the first allocation of CSR in the entry block by
increasing the CSRCost.

Previously, I proposed https://reviews.llvm.org/D34608 in this
matter, but the way I detect the profitable cases and the way I
increase the CRSCost was somewhat unclear. Now, I'm thinking to less
aggressive on the first allocation of CSR in the entry block in case
where the function has an early exit so that encourage more
shrink-wrapping and avoid executing CSR spill/recover when the early
exit is taken. By sending this out, I just want to get any high
level feedback early. Please let me know if anyone has any opinion
about this.

So the heuristic will have nothing to do with the presence of calls?
Might this increase spilling in loops?

-Hal

Before allowing the first allocation from CSRs, I will check if the virtual register is really live across a call in other blocks. If the function have a call in entry or exit, we don't need to increase the CSRCost. This heuristic will be applied only for the very first allocation from CSRs only in the entry block when the function has an early exit; if a CSR is already allocated in the function, we will use the current default global CSRCost.

Even after increasing the CSRCost, we should end up allocating the first CSR if the cost of splitting the live-range is still higher than the increased CSRCost. I believe the amount we want to increase the CSRCost must be target-dependent, but it must be conservative enough to avoid too many copies in the related spots.

Thanks,
Jun

When compiling C code below for AArach64, I saw that shrink-wrapping
didn't happen due to the very early uses of CSRs in the entry block.
So CSR spills/reloads are executed even when the early exit block is
taken.

int getI(int i);

int foo(int *P, int i) {

if (i>0)

return P[i];

i = getI(i);

return P[i];

}

It's not that hard to find such cases where RegAllocGreedy
aggressively allocates a CSRs when a live range expands across a
call-site. That's because of the conservatively initialized
CSRCost, causing RegAllocGreedy to strongly favour allocating a CSR
over splitting a region. Since allocation of CSRs requires the cost
of spilling CSRs, allocating CSRs is not always beneficial. Like the
case above, if a function has an early exit code, we may want to be
less aggressive on the first allocation of CSR in the entry block by
increasing the CSRCost.

Previously, I proposed https://reviews.llvm.org/D34608 in this
matter, but the way I detect the profitable cases and the way I
increase the CRSCost was somewhat unclear. Now, I'm thinking to less
aggressive on the first allocation of CSR in the entry block in case
where the function has an early exit so that encourage more
shrink-wrapping and avoid executing CSR spill/recover when the early
exit is taken. By sending this out, I just want to get any high
level feedback early. Please let me know if anyone has any opinion
about this.

So the heuristic will have nothing to do with the presence of calls?
Might this increase spilling in loops?

-Hal

Before allowing the first allocation from CSRs, I will check if the virtual register is really live across a call in other blocks. If the function have a call in entry or exit, we don't need to increase the CSRCost. This heuristic will be applied only for the very first allocation from CSRs only in the entry block when the function has an early exit; if a CSR is already allocated in the function, we will use the current default global CSRCost.

Even after increasing the CSRCost, we should end up allocating the first CSR if the cost of splitting the live-range is still higher than the increased CSRCost. I believe the amount we want to increase the CSRCost must be target-dependent, but it must be conservative enough to avoid too many copies in the related spots.

Thanks for explaining.

I suppose you'll want to make sure that the call(s) in question come after the early exit (i.e., that there aren't calls before the early exit)?

When you say "only in the entry block" you mean that the live range starts in the entry block, right (i.e., that it is a function parameter or a vreg defined by some instruction in the entry block)? Does it matter that it is in the entry block, or do you only need it to come before the early exit and have an execution frequency <= to the entry block's execution frequency?

  -Hal

When compiling C code below for AArach64, I saw that shrink-wrapping
didn't happen due to the very early uses of CSRs in the entry block.
So CSR spills/reloads are executed even when the early exit block is
taken.

int getI(int i);

int foo(int *P, int i) {

if (i>0)

return P[i];

i = getI(i);

return P[i];

}

It's not that hard to find such cases where RegAllocGreedy
aggressively allocates a CSRs when a live range expands across a
call-site. That's because of the conservatively initialized
CSRCost, causing RegAllocGreedy to strongly favour allocating a CSR
over splitting a region. Since allocation of CSRs requires the cost
of spilling CSRs, allocating CSRs is not always beneficial. Like the
case above, if a function has an early exit code, we may want to be
less aggressive on the first allocation of CSR in the entry block by
increasing the CSRCost.

Previously, I proposed https://reviews.llvm.org/D34608 in this
matter, but the way I detect the profitable cases and the way I
increase the CRSCost was somewhat unclear. Now, I'm thinking to less
aggressive on the first allocation of CSR in the entry block in case
where the function has an early exit so that encourage more
shrink-wrapping and avoid executing CSR spill/recover when the early
exit is taken. By sending this out, I just want to get any high
level feedback early. Please let me know if anyone has any opinion
about this.

So the heuristic will have nothing to do with the presence of calls?
Might this increase spilling in loops?

-Hal

Before allowing the first allocation from CSRs, I will check if the virtual register is really live across a call in other blocks. If the function have a call in entry or exit, we don't need to increase the CSRCost. This heuristic will be applied only for the very first allocation from CSRs only in the entry block when the function has an early exit; if a CSR is already allocated in the function, we will use the current default global CSRCost.

Even after increasing the CSRCost, we should end up allocating the first CSR if the cost of splitting the live-range is still higher than the increased CSRCost. I believe the amount we want to increase the CSRCost must be target-dependent, but it must be conservative enough to avoid too many copies in the related spots.

Thanks for explaining.

I suppose you'll want to make sure that the call(s) in question come
after the early exit (i.e., that there aren't calls before the early
exit)?

I will check if a virtual register is live across only calls which will be executed when the early exit is not taken. By increasing CSRcost for such case, we increase chances to avoid executing CSR spill/recover when the early exit is taken.

When you say "only in the entry block" you mean that the live range
starts in the entry block, right (i.e., that it is a function
parameter or a vreg defined by some instruction in the entry block)?

Yes.

Does it matter that it is in the entry block, or do you only need it
to come before the early exit and have an execution frequency <= to
the entry block's execution frequency?

The reason I specifically check the entry block is because for now I see the early exit happen only in entry block; limiting that the early exit condition is checked in the entry block and branch to the exit block directly from the entry block if hitting the condition. If overall approach is reasonable, we can certainly extend it.

One thing I thought about doing a while back and never really wrote a POC for is the following:

  • Make FirstCSRCost a property of the MachineBasicBlock (or create a map of MBB* → FirstCSRCost)

  • Implement a pre-RA pass that will populate the map as follows:

  • Identify all blocks with calls

  • Find the nearest common dominator (NCD) to all those blocks (not strict so that a block with a call might be that NCD)

  • If the NCD is the entry block, CSR allocation is cheap in all blocks

  • Make CSR allocation cheap in blocks that are in the dominator tree rooted at NCD

The idea would be to favour CSR allocation in blocks that might be eligible for the prologue and favour splitting in blocks that we’d prefer not to have a prologue in (or before).

Then a CFG such as this:
A
/
B C

/
D E

/
/
/
/
/
F

  • Assume calls are in B and any_of(C,D,E): CSR allocation is cheap everywhere

  • Assume calls are in C or all_of(D,E): CSR allocation is cheap in all_of(C,D,E); CSR allocation is expensive in all_of(A,B,F)

  • Assume only call is in any_of(B,D,E): CSR allocation is cheap only in that block, expensive everywhere else

I think this construction would give us what we want, but there may be [obvious] counter-examples I haven’t thought of.

One thing I thought about doing a while back and never really wrote a
POC for is the following:
- Make FirstCSRCost a property of the MachineBasicBlock (or create a
map of MBB* -> FirstCSRCost)

- Implement a pre-RA pass that will populate the map as follows:

  - Identify all blocks with calls

  - Find the nearest common dominator (NCD) to all those blocks (not
strict so that a block with a call might be that NCD)

  - If the NCD is the entry block, CSR allocation is cheap in all
blocks

  - Make CSR allocation cheap in blocks that are in the dominator tree
rooted at NCD

The idea would be to favour CSR allocation in blocks that might be
eligible for the prologue and favour splitting in blocks that we'd
prefer not to have a prologue in (or before).

Then a CFG such as this:
    A
   / \
  B C
  > / \
  > D E
  > > /
  > > /
  > >/
  > /
  >/
  F

- Assume calls are in B and ANY_OF(C,D,E): CSR allocation is cheap
everywhere

- Assume calls are in C or ALL_OF(D,E): CSR allocation is cheap in
ALL_OF(C,D,E); CSR allocation is expensive in ALL_OF(A,B,F)

- Assume only call is in ANY_OF(B,D,E): CSR allocation is cheap only
in that block, expensive everywhere else

I think this construction would give us what we want, but there may be
[obvious] counter-examples I haven't thought of.

Thanks Nemanja for sharing your idea. I think this might cover the case I was targeting, and we may not need to be limited only in the entry block.
However, I bit worry if this cause RegAllc to slow in allocating CSRs. What if we have a call-site in the early exit path and no call-site in all other blocks. Then, conservative allocation of CSRs might not be a good choice if the reg pressure is high in the all other blocks. As our first step, would it make sense to limit this only when we detect an early exit? I guess Quentin may have some comment.

thanks,
Jun

Hi,

I think it is kind of artificial to tie the CSRCost with the presence of calls.

I think I’ve already mentioned it in one of the review, but I believe it would be better to differentiate when we want to use a CSR to avoid spilling or to avoid splitting. CSR instead of spilling is good, CSR instead of splitting, not so good :).
By doing this, I would expect we mechanically get the desired behavior that CSRs get used for live-ranges that go through calls (otherwise we would have spilled).

My 2c.

Cheers,
-Quentin

Hi,

I think it is kind of artificial to tie the CSRCost with the presence
of calls.
I think I’ve already mentioned it in one of the review, but I
believe it would be better to differentiate when we want to use a CSR
to avoid spilling or to avoid splitting. CSR instead of spilling is
good, CSR instead of splitting, not so good :).

About this, I can see your previous comment in D27366 (I copied it below) :
   Also, that's possible that the right fix/simple fix is to have one CSRCost for split and one for spill.
   Indeed, IIRC, right now the returned cost for both spilling and splitting is going to be the sum of the frequencies where the split/spill happen and given the spill and copy have different cost, we may want to have different comparison.
   E.g., CSRCostForSpill = 5 (ok to trade against more than 5 executed spill/reload) but CSRCostForSpilt = 20 (ok to trade against more than 20 executed copies)

If this is what you meant here, is the CSRCostForSpilt the actual cost directly comparable with the split cost?
Or, it should be multiplied with the entry frequency to be comparable with the split cost, considering that the CSRCost is the cost of spilling CSR in the entry?

Hi,
I think it is kind of artificial to tie the CSRCost with the presence
of calls.
I think I’ve already mentioned it in one of the review, but I
believe it would be better to differentiate when we want to use a CSR
to avoid spilling or to avoid splitting. CSR instead of spilling is
good, CSR instead of splitting, not so good :).

About this, I can see your previous comment in D27366 (I copied it below) :
Also, that’s possible that the right fix/simple fix is to have one CSRCost for split and one for spill.
Indeed, IIRC, right now the returned cost for both spilling and splitting is going to be the sum of the frequencies where the split/spill happen and given the spill and copy have different cost, we may want to have different comparison.
E.g., CSRCostForSpill = 5 (ok to trade against more than 5 executed spill/reload) but CSRCostForSpilt = 20 (ok to trade against more than 20 executed copies)

If this is what you meant here, is the CSRCostForSpilt the actual cost directly comparable with the split cost?
Or, it should be multiplied with the entry frequency to be comparable with the split cost, considering that the CSRCost is the cost of spilling CSR in the entry?

I believe it should be compared directly with the split cost. I.e., CSRCostXXX against CostOfTheOtherChoiceWithFrequency.

Hi,
I think it is kind of artificial to tie the CSRCost with the
presence
of calls.
I think I’ve already mentioned it in one of the review, but I
believe it would be better to differentiate when we want to use a
CSR
to avoid spilling or to avoid splitting. CSR instead of spilling
is
good, CSR instead of splitting, not so good :).

About this, I can see your previous comment in D27366 (I copied it
below) :
Also, that's possible that the right fix/simple fix is to have one
CSRCost for split and one for spill.
Indeed, IIRC, right now the returned cost for both spilling and
splitting is going to be the sum of the frequencies where the
split/spill happen and given the spill and copy have different cost,
we may want to have different comparison.
E.g., CSRCostForSpill = 5 (ok to trade against more than 5 executed
spill/reload) but CSRCostForSpilt = 20 (ok to trade against more
than 20 executed copies)

If this is what you meant here, is the CSRCostForSpilt the actual
cost directly comparable with the split cost?
Or, it should be multiplied with the entry frequency to be
comparable with the split cost, considering that the CSRCost is the
cost of spilling CSR in the entry?

I believe it should be compared directly with the split cost. I.e.,
CSRCostXXX against CostOfTheOtherChoiceWithFrequency.

As far as I know the spill/split cost which is compared with CSRCost is the sum of the frequencies of spots where spill/split happen, and the CSRCost is the cost of spilling CSR at the entry. If we say, for example, 20 copies from pre-split are okay to trade against 1 csr spill at the entry, then shouldn't we multiply FreqOfEntry with the 20 (number of tradable copies)? Please me know if I miss something here?

Hi,
I think it is kind of artificial to tie the CSRCost with the
presence
of calls.
I think I’ve already mentioned it in one of the review, but I
believe it would be better to differentiate when we want to use a
CSR
to avoid spilling or to avoid splitting. CSR instead of spilling
is
good, CSR instead of splitting, not so good :).

About this, I can see your previous comment in D27366 (I copied it
below) :
Also, that’s possible that the right fix/simple fix is to have one
CSRCost for split and one for spill.
Indeed, IIRC, right now the returned cost for both spilling and
splitting is going to be the sum of the frequencies where the
split/spill happen and given the spill and copy have different cost,
we may want to have different comparison.
E.g., CSRCostForSpill = 5 (ok to trade against more than 5 executed
spill/reload) but CSRCostForSpilt = 20 (ok to trade against more
than 20 executed copies)
If this is what you meant here, is the CSRCostForSpilt the actual
cost directly comparable with the split cost?
Or, it should be multiplied with the entry frequency to be
comparable with the split cost, considering that the CSRCost is the
cost of spilling CSR in the entry?

I believe it should be compared directly with the split cost. I.e.,
CSRCostXXX against CostOfTheOtherChoiceWithFrequency.

As far as I know the spill/split cost which is compared with CSRCost is the sum of the frequencies of spots where spill/split happen, and the CSRCost is the cost of spilling CSR at the entry. If we say, for example, 20 copies from pre-split are okay to trade against 1 csr spill at the entry, then shouldn’t we multiply FreqOfEntry with the 20 (number of tradable copies)?

Yes, you’re right with FreqOfEntry. I assumed this was 1.

Hi,
I think it is kind of artificial to tie the CSRCost with the
presence
of calls.
I think I’ve already mentioned it in one of the review, but I
believe it would be better to differentiate when we want to use a
CSR
to avoid spilling or to avoid splitting. CSR instead of spilling
is
good, CSR instead of splitting, not so good :).
About this, I can see your previous comment in D27366 (I copied it
below) :
Also, that's possible that the right fix/simple fix is to have one
CSRCost for split and one for spill.
Indeed, IIRC, right now the returned cost for both spilling and
splitting is going to be the sum of the frequencies where the
split/spill happen and given the spill and copy have different cost,
we may want to have different comparison.
E.g., CSRCostForSpill = 5 (ok to trade against more than 5 executed
spill/reload) but CSRCostForSpilt = 20 (ok to trade against more
than 20 executed copies)
If this is what you meant here, is the CSRCostForSpilt the actual
cost directly comparable with the split cost?
Or, it should be multiplied with the entry frequency to be
comparable with the split cost, considering that the CSRCost is the
cost of spilling CSR in the entry?

I believe it should be compared directly with the split cost. I.e.,
CSRCostXXX against CostOfTheOtherChoiceWithFrequency.

As far as I know the spill/split cost which is compared with CSRCost
is the sum of the frequencies of spots where spill/split happen, and
the CSRCost is the cost of spilling CSR at the entry. If we say, for
example, 20 copies from pre-split are okay to trade against 1 csr
spill at the entry, then shouldn't we multiply FreqOfEntry with the 20
(number of tradable copies)?

Yes, you’re right with FreqOfEntry. I assumed this was 1.

Considering current implementation in initializeCSRCost() where the CSRCost is scaled by the FixedEntry(1 << 14), if we want to set 20 for CSRCostForSplit (20 copies are tradable with one CSR spill in the entry), then I think the initial raw value for CSRCostForSplit should be 20 * (1 << 14) to allow us to multiply 20 with the actual frequency of the entry. Do you agree with it?

Hi,
I think it is kind of artificial to tie the CSRCost with the
presence
of calls.
I think I’ve already mentioned it in one of the review, but I
believe it would be better to differentiate when we want to use a
CSR
to avoid spilling or to avoid splitting. CSR instead of spilling
is
good, CSR instead of splitting, not so good :).
About this, I can see your previous comment in D27366 (I copied it
below) :
Also, that’s possible that the right fix/simple fix is to have one
CSRCost for split and one for spill.
Indeed, IIRC, right now the returned cost for both spilling and
splitting is going to be the sum of the frequencies where the
split/spill happen and given the spill and copy have different cost,
we may want to have different comparison.
E.g., CSRCostForSpill = 5 (ok to trade against more than 5 executed
spill/reload) but CSRCostForSpilt = 20 (ok to trade against more
than 20 executed copies)
If this is what you meant here, is the CSRCostForSpilt the actual
cost directly comparable with the split cost?
Or, it should be multiplied with the entry frequency to be
comparable with the split cost, considering that the CSRCost is the
cost of spilling CSR in the entry?

I believe it should be compared directly with the split cost. I.e.,
CSRCostXXX against CostOfTheOtherChoiceWithFrequency.
As far as I know the spill/split cost which is compared with CSRCost
is the sum of the frequencies of spots where spill/split happen, and
the CSRCost is the cost of spilling CSR at the entry. If we say, for
example, 20 copies from pre-split are okay to trade against 1 csr
spill at the entry, then shouldn’t we multiply FreqOfEntry with the 20
(number of tradable copies)?
Yes, you’re right with FreqOfEntry. I assumed this was 1.

Considering current implementation in initializeCSRCost() where the CSRCost is scaled by the FixedEntry(1 << 14), if we want to set 20 for CSRCostForSplit (20 copies are tradable with one CSR spill in the entry), then I think the initial raw value for CSRCostForSplit should be 20 * (1 << 14) to allow us to multiply 20 with the actual frequency of the entry. Do you agree with it?

Agreed.