ARMv4T Copy Lowering

Jim/Tim/Renato,

A few days ago (has it been weeks now?) we discussed a codegen problem on
armv4t having to do with lo->lo register copies. I'd like to start that
discussion again, this time with a patch.

A brief summary of the problem for folks who didn't catch the discussion
earlier, and those like me who forget what they ate for breakfast: ;]
The mov instruction on armv4t (specifically, thumb1) is limited in that it
cannot copy from lo register to lo register, but the register allocator assumes
it can do whatever it wants. This means that copies get emitted which don't
work on that architecture, leading to broken code even for trivial cases like:
"void foo(int a, int b) { bar(b, a); }".

There are several different options which we discussed, and various trade-offs
for each (roughly ordered in amount of work to implement):
    1) push {$src} ; pop {$dst}
         The idea here is that since push & pop are relatively unrestricted in
         which registers they can operate on, we can use the stack to do the
         lo -> lo copy. The big drawback here is performance (both code size and
         speed). On the other hand, it looks 'obviously' correct. The attached
         patch implements this.
    2) mov hi, $src ; mov $dst, hi
         Here, we copy through some high register, but the options on which one
         are a bit limited:
            r8 - Callee save (in ARM mode, unused in Thumb mode)
            r9 - IIRC, this one is a platform defined special register?
            r10 - Caller save (in ARM mode, unused in Thumb mode)
            r11 - Caller save (in ARM mode, unused in Thumb mode)
            r12 - Linker Scratch
         Callee save regs are basically a non-starter, and r9 is not available
         on all platforms so that leaves r10-12. I tired using r12, but I think
         this clashes with assumptions made by the register scavenger that it can
         use that register. I haven't tried using r10 & r11.
           At the time when we discussed this, it was suggested that I just let
         the register scavenger take care of it. I tried that unsuccessfully,
         and I don't remember what the hang-up was.
    3) movs $dsdt, $src
         This one clobbers the cpsr. This one works for a lot of cases, but
         fails for phi's that end up between, say a cmp and a branch at the
         bottom of a loop. To use this one, we'd need liveness information at
         the location of the copy in order to ensure that we don't clobber cpsr
         when it's actually going to be used. I went looking for liveness
         information, and found a function that looked like it calculated it for
         a small window around a MachineBasicBlock::iterator... that seems like
         it would work if I told it that the window was the whole basic block. Am
         I missing a better way to get post-RA liveness? I haven't tried
         implementing this.
    4) Do some combination of the others, and tell the register allocator
        that non-cpsr-clobbering lo-lo copies are slower, and that it should
        avoid them.

I've got the patch ready that implements (1). Does it make sense to commit this
(or something close to it) in the near term to fix the bug, and let someone
else worry about performance later? (Possibly me, but it will probably be a
a while before I come back to this. Hopefully this thread serves as a nice
bread-crumb for anyone who is interested)

Cheers,

Jon

pushpop_armv4t_lowercopy.diff (3.48 KB)

Hi Jon,

Correctness before performance any day.

This patch is simple and obvious enough that you can commit after you
add some test showing that it'll select push/pop for v4/v5 on thumb1
and not for other combinations (up to v6 is enough). Also, please add
some short FIXME one-liner saying that we should try high registers
but beware of scheduling issues inside your current comment.

I'm sure it'll boil down as you working on this again, as I don't know
anyone else *that* interested in v4T. :slight_smile:

Thanks!
--renato

I've got the patch ready that implements (1). Does it make sense to commit
this
(or something close to it) in the near term to fix the bug, and let someone
else worry about performance later? (Possibly me, but it will probably be a
a while before I come back to this. Hopefully this thread serves as a nice
bread-crumb for anyone who is interested)

Hi Jon,

Correctness before performance any day.

This patch is simple and obvious enough that you can commit after you
add some test showing that it'll select push/pop for v4/v5 on thumb1
and not for other combinations (up to v6 is enough). Also, please add
some short FIXME one-liner saying that we should try high registers
but beware of scheduling issues inside your current comment.

Sounds good, committed as r216138.

Crap, I forgot to mention Iain in the commit log... thanks for his help on this too!

I'm sure it'll boil down as you working on this again, as I don't know
anyone else *that* interested in v4T. :slight_smile:

Heh :slight_smile: