It is my understanding that the implementation for jump-threading in LLVM is not presently able to effectively optimize code containing a state-machine implemented using a loop + switch. This is the case, for example, with the Coremark benchmark function core_state_transition(). Bug 42313 was filed to address this in 2019:
We have a jump threading pass downstream for this that we would love to upstream. I believe Evgeny was working on exactly this, i.e. preparing it for upstreaming.
And related while we are at it, i.e. the coremark specials, we have this sitting in upstream review: https://reviews.llvm.org/D42365
That should help a bit too. It needs a little bit of work, but I thought Dave didn’t mind if someone commandeers and finishes it.
No active work on my side, but I have given the topic of threaded interpreters (which is what I think you’re wanting to produce) a good amount of thought.Â
I’m really not sure that switch is the right canonical form. The main reason being that having a loop over a large switch is very likely to encourage code motion which is generally profitable, but harmful in this particular context.
I had been thinking down the lines of representing the intepreter as a family of mutually recursive functions with a calling convention optimized for this case and using a musttail call through a lookup table for the dispatch. Â
I’ve played with the notion of extending clang with a custom attribute for guaranteed tail calls. I think this is pretty much the only extension needed to be able to natively write out a threaded interpreter as a set of mutually recursive functions.
This is all thought experiment from my side; I haven’t had time to sit down and actually prototype any of this.
This is great to hear, thanks for the update! Can you provide a timeline for when you expect to have the jump-threading update in review?
The loop-flattening pass also looks very useful. I’m sure we could help with the review – I don’t know about taking ownership of it, at least right now.
We (at Huawei) also have a pass for this. Originally we implemented this back in 2018 and meant to upstream it, but there were some issues with the implementation that required some changes in the code. We started revising it,a few weeks ago.
I thought now that there are multiple options, maybe we can discuss our approaches, and see if there is a preference in the community for one approach over the other ? What do you think?
But yeah, if you’ve got one ready too, perhaps best to upload it if that’s what you want to do. I guess that makes it easier to look at both and see which one would have our preference?
No active work on my side, but I have given the topic of threaded interpreters (which is what I think you’re wanting to produce) a good amount of thought.
I’m really not sure that switch is the right canonical form. The main reason being that having a loop over a large switch is very likely to encourage code motion which is generally profitable, but harmful in this particular context.
I had been thinking down the lines of representing the intepreter as a family of mutually recursive functions with a calling convention optimized for this case and using a musttail call through a lookup table for the dispatch.
I believe the Wasm3 project (https://github.com/wasm3/wasm3) which is a WebAssembly interpreter is using this dispatch technique described by Philip.
I don’t know how exactly it is guaranteed that the indirect calls are converted to tail calls (maybe it’s not). But the performance is quite impressive.
Interesting, I hadn’t seen this. Reading through the docs, it looks like M3 (the wasm3 interpreter) isn’t able to guarantee tail call dispatch which isn’t surprising. There’s a whole section on managing stack space with some special tricks around loops. I will note there’s at least one key misstatement in the description. Branches do not fundamentally need stack space. Calls do - as correctly noted. Still cool to see someone playing with this in modern times, most of the usage I’m familiar with is in older papers (e.g. the threaded code context referenced at the bottom of the M3 description).
That’s great. Our patch wouldn’t be ready earlier than 2-3 weeks from now. In the next few days we will spend some time reviewing your patch and will let you know if we have any comments.