How to get Greedy RA to not spill results of trivially rematerializable instructions

I have encountered a rather odd situation with Greedy where it will end up spilling a register that was populated with a zero (with a trivially rematerializable load-immediate instruction).
In fact, it spills 3 such values (LICM moves stuff out of a loop, register coalescer replaces copies with load-immediates and then Greedy spills them).

I personally can’t think of a situation where a spill (with a reload later presumably) is better than simply rematerializing the value where it would have otherwise been reloaded. To that end, would it be possible for Greedy to simply duplicate the trivially rematerializable instruction at every reload site? Perhaps this is something it would need to query the target for? Perhaps Greedy would be able to call something like TargetInstrInfo::rematerializeValue(MachineInstr &RematMI, MachineBasicBlock::iterator InsertAt) or something along those lines?

Do you have a reproducer?

That shouldn’t happen.

I do have a reproducer, but it’s not for the faint of heart :slight_smile:

This is from a large and messy C file (Perlbench’s regexec.c), reduced by bugpoint down to 1050 lines of IR. Perhaps I can paste it on pastebin.

Just for fun, I added some debug dumps for machine instructions that spill registers (i.e. return non-zero from MachineInstr::getFoldedSpillSize()) that are fed by load-immediates and kill that register. Then I bootstrapped LLVM/Clang/compiler-rt with those dumps. Turns out there are 5692 occurrences of that. I might have more luck reducing one of those files.

Finally managed to reduce this to something manageable: https://godbolt.org/z/Hw529k

On line 40 of the output, we have a load-immediate to put zero into R3. Then we spill that value on the next line. And as far as I can tell, we reload it on line 97 before the call to getValueAsBit().

Thanks for the reduced test case, I’ll try to take a look by the end of the week.

Hi Nemanja,

I haven’t looked in the compiler, but just from the output assembly, this is more complicated than plain rematerialization.

Unless I read the assembly wrong, what we spill is not a simple constant but two different values based on some condition.
Basically, we are looking at code that looks like this:
If <…>
R3 = some value

Else
R3 = cst
= R3

When we spill R3, we don’t know if we are going to get the if or else part of R3. Furthermore, if R3 was a constant on both paths, we would need to emit a select-like instruction to pick the right value, which is definitely not simple remarterialization.

Let me know if I miss something, otherwise I feel there isn’t much we need to do here.

Cheers,
-Quentin

Quentin, thanks so much for looking at this. I should have noticed the other spill to the same stack slot if control doesn’t flow through block 2 (line 32).

I am sorry to have wasted your time. For the original issue, we won’t be able to do anything for the spills, but we can clean up the issue where we materialize the same constant multiple times into the same register just to spill it.

Nemanja

Quentin, thanks so much for looking at this. I should have noticed the other spill to the same stack slot if control doesn’t flow through block 2 (line 32).

I am sorry to have wasted your time.

No worries.

For the original issue, we won’t be able to do anything for the spills, but we can clean up the issue where we materialize the same constant multiple times into the same register just to spill it.

I am not sure I follow that part. Could you elaborate?

Oh, all I mean is that we can clean up what we saw in the original code:

li r3, 0

std r3, N1(r1) # spill to stack slot 1, kills R3

li r3, 0
std r3, N2(r1) # spill to stack slot 2, kills R3
li r3, 0
std r3, N3(r1) # spill to stack slot 3, kills R3

We can detect this in a peephole, realize that we keep loading the same value into r3, get rid of the kill flags and unnecessary constant materialization and allow the HW to execute the stores in parallel rather than having to execute the entire sequence in order.

Oh, all I mean is that we can clean up what we saw in the original code:

li r3, 0

std r3, N1(r1) # spill to stack slot 1, kills R3

li r3, 0
std r3, N2(r1) # spill to stack slot 2, kills R3
li r3, 0
std r3, N3(r1) # spill to stack slot 3, kills R3

We can detect this in a peephole, realize that we keep loading the same value into r3, get rid of the kill flags and unnecessary constant materialization and allow the HW to execute the stores in parallel rather than having to execute the entire sequence in order.

Agree.