Hi! I’m experimenting with using embedded Rust to program an esp32c3 (riscv isa), and I’ve been chasing performance & especially consistency for our interrupt service routines, as at least one of them has a hard real-time bound. The esp32c3 has a very simple memory hierarchy: there’s 400KiB of SRAM, which is all addressable within about one CPU cycle, and 4MiB of “external Flash” that’s mapped by way of a cache. I haven’t gotten firm numbers on just how much slower a cache miss is than a one-cycle SRAM access, but it’s a lot. More importantly, it’s a source of variability that we’d like to avoid.
Ok, so that’s what I’m confident about. Everything that follows from this point is more or less an attempt at applied Cunningham’s Law where I present what I’ve done and (hopefully!) y’all tell me exactly how up-the-wrong-tree I’m barking.
The primary tool I’ve been applying has been moving things between ELF sections so we put every symbol inside a segment that gets loaded into SRAM at program start and avoids the cache entirely: this has produced almost zero variance (<1ppm cycles, as measured by the CPU’s built-in perf register) on the parts of the program I’ve been able to apply it to. The next target on my list, though, are the jump tables LLVM is emitting. As an example, here’s one implementing a mapping from a code (1-16) to a matching
interruptN function (not shown):
What I’m seeing here is that our SRAM-resident[^1] function (
0x4038... addresses on the left) is loading its lookup table from Flash (
0x3c0... address) by way of an lui/addi pair (the former setting the top 20 bits of a register, and the latter being used to fill in the bottom 12 bits). That’s happening because our
.rodata section is getting placed into the Flash range, which is where LLVM seems to be placing its jump tables as well.
Since we can’t afford to move the whole
.rodata section into memory, I was hoping we could find a way to ask LLVM to emit its jump tables in a way that we could move around more easily: concretely, I’d like to express that “all jump tables should be memory resident.” Otherwise they’re likely a de-optimization when used from RAM,[^2] and they very certainly introduce variability. That doesn’t seem possible at the moment, though: the linker doesn’t have the capability to operate at the symbol level (it only appears to map input sections to output sections), and the ELF lowering appears to be very determined to use the
.rodata section. If we figure out how to pass the equivalent of
-ffunction-sections to LLVM through rustc (it may be on by default?), it looks like we could get
.rodata.<function>, but I’m not sure that helps much: I don’t see an obvious way to exclude a specific name from an [input section pattern][input-wildcards] in a linker script, so I’m not sure how we’d express “handle this function’s sub-symbols differently.”[^3]
The other less-bad idea I’ve had is to try and disable jump table generation entirely by way of the
no-jump-tables function attribute, following one of the strategies mentioned over in this similar discussion, but that seems impossible due to [differences in the C++ API for option parsing][rustc-llvm-args-issue] (it appears the various
-m options have to be explicitly enabled for that to work?). What I can do seems to include:
$ rustc -C llvm-args='--help-list-hidden' | grep -- 'jump-table' ... --min-jump-table-entries=<uint> - Set minimum number of entries to use a jump table. --max-jump-table-size=<uint> - Set maximum size of jump tables. ...
and, setting the first one to something large seems to do nothing (my guess is that rustc is also setting that and clobbering my setting?), but adding
-C llvm-args=--max-jump-table-size=0 to my
.cargo/config.toml does indeed seem to turn off jump tables globally, but that conclusion doesn’t feel very satisfying to me: jump tables are a useful optimization, as long as they’re within the same (or better) “tier” of the memory hierarchy.
My worst idea is to write a linker wrapper that munges the input ELF on its way in to move the jump tables around and precisely express the notion that flash->flash or memory->memory (or even flash->memory) is ok: it’d work, but it’d be a lot of work.
So, specific questions:
- Am I wrong that there’s not enough granularity in the linker script language to use it to accomplish my goal? That’d be really great if so.
- Would a patch that moves the jump tables into their own section (
.rodata.ljti?) be something the LLVM project’d be likely to accept? It’s a long row to hoe with rustc to realize that change in my own project, but it would seem to have the merit of being more or less conceptually in-line with all the existing moving parts.
- Am I missing something about how to more precisely target that
no-jump-tablesfunction attribute, so I can apply it to just the narrow set of functions in my interrupt call path? I’m not sure what a “module” is in LLVM or how it relates to rust’s notion of either “crates” or “modules” (I would guess it’s closer to the former), but maybe that’s a useful scope?
- Do you happen to be aware of any prior art that performs the kind of ELF pre-linking… relocation (I guess?) that I described? The closest thing I’ve found so far is [flip-link] , but it looks like that tool’s theory of operation is by munging the linker scripts, not the binary directly.[^4]
Thanks for your time!
(links broken to get under the new user limit, sorry)
[input-wildcards]: sourceware. org/binutils/docs/ld/Input-Section-Wildcards.html
[rusct-llvm-args-issue]: github. com/rust-lang/rust/issues/26338
[flip-link]: github. com/knurling-rs/flip-link
[^1]: The memory map for this device is such that addresses starting with
0x403[7-f]... are SRAM-resident code, those starting with
0x3fc[8-f]... are SRAM-resident data, whereas
0x3c[0-7]... addresses are Flash-resident code and data, respectively.
[^2]: At a guess, we can probably evaluate a couple dozen not-taken branches per cache miss.
[^3]: We’d also have to enumerate every function whose jump tables we wanted to be in memory, but we’ve got to do something similar to identify the trap handling routine’s transient closure anyway, so we’ve already traded away most of the generality we’d lose in doing that.
[^4]: I mean, I kind of understand why they didn’t simply write a large-ish subset of a linker.