[RFC] prestacked annotation to solve risc-v interrupt stacking mess

(putting rfc in the most risc-v discourse part, not sure where it should go exactly)

Proposal text is currently hosted inside of Xteic spec: https://github.com/jnk0le/riscv-total-embedded/blob/master/riscv-total-embedded.adoc#151-prestacked-annotation

full text in case link dies (adoc copypaste)

==== prestacked annotation

Currently there is no universal solution to indicate which registers in interrupt handlers
can be freely used without stacking them.

  • \\__attribute__\((interrupt)) makes all registers callee saved and uses mret to return.
  • \\__attribute__\((interrupt("SiFive-CLIC-preemptible"))) extends regular interrupt by CLIC preemption
  • \\__attribute__\((interrupt("WCH-Interrupt-fast"))) requires custom build toolchain,
    no floating point regs (even on the cores with F extension), still uses mret
  • Or just a plain C function that requires prestacking of all caller saved registers, reuses standard
    return mechanism to exit interrupt context

Even worse, there are already hardware stackers designed for ilp32e and ilp32. When the new and better ABI will be introduced, it will be impossible to use with pre-existing HW stackers. The same applies to creating HW stackers that stack less registers to optimize interrupt latency.

Therefore we need universal way to annotate which registers are available for use in a given function as a defacto calller saved one (aka create custom calling convention)

  • prestacked("") attribute
  • no whitespaces in string parameter
  • register range cover all registers between and including specified (x4-x6 is equivalent to x4,x5,x6)
  • registers/ranges are separated by comma
  • calee saved registers have to be properly turned into temporary when included in the list
  • CSRs taking part in calling conventions are also subject to this mechanism
  • should use raw names instead of ABI mnemonics as to make it ABI agnostic (more portable)
  • registers must be sorted (integer, floating point, vector, custom, then by lowest numbered)
  • CSRs must be put after the architectural regfiles, those don’t have to be sorted
  • must not collide with \\__attribute__\((interrupt)) as to support “legacy” handler return mechanisms
  • must not imply \\__attribute__\((interrupt)) as well
  • custom CSRs would also have to be somehow covered. (hw loops etc.)

psABI caller saved:

\\__attribute__\((prestacked("x5-x7,x10-x17,x28-x31")))

psABI with floating point, caller saved:

\\__attribute__\((prestacked("x5-x7,x10-x17,x28-x31,f0-f7,f10-f17,f28-f31,fcsr")))

Simplified ranges (e.g. shadow register file):

\\__attribute__\((prestacked("x8-x15")))

TEIC irq, range0 + shadow regs of half integer regfile (where bit 2 of operand is set, covers range1+2)
and F + P extensions:

\\__attribute__\((prestacked("x4-x7,x10,x11,x12-x15,x20-x23,x28-x31,fcsr,vxsat")))

ch32v003 irq (ilp32e + PFIC HW stacker, assuming ra doesn’t have some undocumented use):

\\__attribute__\((interrupt, prestacked("x1,x5-x7,x10-x15")))

NOTE: unannotated ra is assumed as a valid return address, otherwise a special return mechanism must be
used (e.g. return by mret in \\__attribute__\((interrupt))

===== optimization for noreturn functions

gcc/llvm compilers can purge the epilogue (even down the call tree) by automatic
detection of infinite loop or by using \\__attribute__\((noreturn)) or __builtin_unreachable().

It is not the case on prologues though, leading to waste of stack and codespace in the most typical
embedded scenario of main or thread functions with an infinite loops.

This missing optimization is intentional <<noreturnprologue>> to allow backtracing
(abort() etc.) and throwing exceptions (of course under -fno-exceptions and exception less code)

By abusing the “prestacked annotation” we can get rid of this prologue
by “prestacking” all of the available registers. +
e.g. \\__attribute__\((noreturn, prestacked("x1,x4-x31,f0-f31,fcsr")))

NOTE: addition of noreturn_nobacktrace_noexcept attribute is very unlikely, optimizing
regular noreturn attribute is even less.

NOTE: \\__attribute__\((naked)) won’t work, as it will remove the stack allocation
and consequently underflow the stack.

===== functions with partially custom calling conventions

It can be additionally abused to:

  • define IPRA clobbers of assembly functions in its C function declarations
    (see <<applying IPRA to assembly functions>>)
  • certain (premature) optimizations (manually solving 2way IPRA recursion etc.)
  • dynamic linked functions with a subset of clobbers.
    e.g. functions like memcpy(),strcmp() etc. don’t need to clobber all caller saved registers
    so only common clobbers for straightforward, unrolled (?) and vectorized implementations need to be
    applied. Requires standardization of canonical clobbers for each offending function. (quite unrealistic)

[[[noreturnprologue, 32]]] 56165 – Missed optimization for 'noreturn' functions

Now, let me introduce you to the current mess related to __attribute__((interrupt("WCH-Interrupt-fast"))) and how ridiculous it’s got so far.

Because it’s a custom (and unsupported) attribute there are attempts with __attribute__((naked)) (gcc allows it) which ends up like here:

There is a workaround already implemented in various places (and rediscovered in various threads) which forces normal ABI clobbered IRQ at minimal overhead:

void USBHS_IRQHandler (void) __attribute__((naked));
void USBHS_IRQHandler (void)
{
  __asm volatile ("call USBHS_IRQHandler_impl; mret");
}

__attribute__ ((used)) void USBHS_IRQHandler_impl (void)
{
  tud_int_handler(0);
}

But wait a sec. ch32v307 has an FPU. What does it mean?

yep, workaround is still broken. Popular gcc builds (with WCH attribute) are broken as well.

As a quick recap:

  • __attribute__((interrupt)) still works but is inefficient
  • workaround will break when FP is involved
  • There are gcc builds with broken __attribute__((interrupt("WCH-Interrupt-fast"))) floating around, and considering that removal of FP regs (from prestacking) is a follow up patch; It is likely that some of the proprietary builds from MRS are also broken.
  • Is llvm going to add this WCH attribute (and any follow ups) by the way?

When another vendor implements it’s own HW stacker, they will go for a standard corporate NIH route and the cycle repeats.

The prestacked mechanism can be ported to other archs, but those have stabilised irq situation and only auxiliary purposes apply.

It would be best to somehow sync with gcc so it’s more portable across at least libre compilers.

Sorry this proposal hasn’t had a response for a while. @kito-cheng and I wondered if you’d consider making this proposal in the riscv-c-api-doc repo as that would better allow LLVM + GNU devs to review and hopefully agree upon it.

I’ll make the PR to c api repo

PR submitted for review. Made it a bit more synthetic there.

1 Like