Handling register allocation on Propeller 2

This is a complex situation, so I’m opting to ask the people in this list for assistance, as I’m still new to LLVM’s codebase.

The Parallax Propeller 2 (Henceforth P2) has 496 allocatable 32-bit registers, 512 total.
There are also two special registers, PTRA and PTRB, that have special semantics when used with memory reads/writes to permit incrementing/decrementing them in place and adding an index value to them. PTRA will likely be the stack register, but PTRB will likely be free for allocation.

First issue: Allocating all of them is a bad idea. Space needs to be left for interrupt handlers, core-local global data, etc. ideally the compiler would only use, say, 384 of them or less. Even more ideally, the amount a particular function uses would be configurable to permit situations where the developer needs more of the regfile to themself, but I have no idea how to approach that.

Second issue: When dealing with larger objects in the regfile, it is strongly advisable to keep them continuous. It’s possible to bulk-save/bulk-load any group of continuous registers in two instructions, at a rate of one saved per cycle. What’s be the best way to utilize this? As this also impacts, say, loading small arrays and structs into the regfile.

Third issue: The regfile can be indexed indirectly for cheap, and instructions exist to load individual aligned nibbles, bytes, and words from a reg into another reg (even indirectly, so array access works). Memory reads/writes are slow individually (9-26 cycles and 3-20 cycles respectively) so ideally this’ll be taken advantage of somehow. This would permit rapidly loading a small array or similar into the regfile and indexing it from there, which, if the array is used multiple times, would almost always be faster if it’s small, as the bulk read would take roughly 9-25 + read_amount cycles.

As far as I can tell, this isn’t an easy thing to take quick advantage of, as the regfile can be treated as a bank of fast core local memory (I.e. a zero page), and LLVM doesn’t seem immediately happy with this idea.

Fourth issue: The P2 has two flag regs, C and Z. All instructions that write them have the ability to control, individually, if the flag is written. Alongside this, Boolean operations and moves with the flags, and between the flags and any bit of a register, are all single instruction (and cheap-as-a-move) operations. What’d be a good way to take advantage of this?

The P2 as of now has no standard C calling convention (nor any calling convention suitable for that), so I’m also stuck trying to define a calling convention for this architecture. Any help with that would be appreciated as well, because I’m not familiar with the requirements nor general advice.

Sorry if this is a bit much to ask, any help and/or advice is appreciated.
—Braden N.

Just a few very quick pointers which may or may not be of help.

First issue: Allocating all of them is a bad idea. Space needs to be left for interrupt handlers, core-local global data, etc.

ideally the compiler would only use, say, 384 of them or less. Even more ideally, the amount a >particular function uses

would be configurable to permit situations where the developer needs more of the regfile to themself, but I have no idea how to approach that.

A purely static way of doing this is to simply define your register classes accordingly (say 384 in an allocatable class, the rest not).

A dynamic way is to use MyTargetRegisterInfo::getReservedRegs. For a straightforward example, see the RISCV backend which provides a user option to reserve registers. For a more complicated scheme, see the AMDGPU backend which trades occupancy vs registers.

Second issue: When dealing with larger objects in the regfile, it is strongly advisable to keep them continuous. It’s possible

to bulk-save/bulk-load any group of continuous registers in two instructions, at a rate of one >saved per cycle. What’s be the

best way to utilize this? As this also impacts, say, loading small arrays and structs into the regfile.

The question is a bit vague or too general, you might ask more detailed questions in a separate thread. That said, as far as mechanical issues such as just representing such “load multiple” instructions of the ISA, see the ARM or SystemZ for examples (the former also performs some memcpys with ld/st-multiple).

If this is about a more general question of how to “best” assign aggregates/objects to the register file, that can have many dimensions. One could analyze all the objects as a whole and choose some for inclusion and others not through some optimization criteria-- there is a large body of research on this problem (and it isn’t LLVM specific). Concretely, you might take a look at the AMDGPUPromoteAlloca pass as well as the StackColoring pass.

Fourth issue: The P2 has two flag regs, C and Z. All instructions that write them have the ability to control, individually, if the flag is written. Alongside this, Boolean operations and moves with the flags, and
between the flags and any bit of a register, are all single instruction (and cheap-as-a-move) operations. What’d be a good way to take advantage of this?

Without more information about the ISA, it is hard to say much. The feature you describe is similar to the “recording” PowerPC instructions where appending (or not) a “.” to such instructions records (or not) certain status bits. Generally, setting flags like these can serialize instructions during scheduling for processors where that is important, so it is often best not to set them unless needed (e.g., for branching). Whether that matters in your case is unknown.