Optimising loops for simple consumers

Hi,

I’m currently working on wasm in V8 and looking at cases where the baseline compiler ‘liftoff’, which decodes and generates code simultaneously, really struggles to generate fast code. For tight loops, spilling and restoring to the stack can be a real bottleneck.

I’ve found that annotating loops with param and using loop local variables, effectively inserting phis and using LCSSA form, can result in a dramatic uplift in execution speed.

So, my question is whether it is possible to generate loops in this manner? Looking at the addPreEmitPass sequence there seems to be a lot going on, is there a phase ordering problem there which would prevent this or could an existing pass undo any work of using loop variables…?

And this isn’t a thinly veiled request for someone else to do some work, I’m very happy to do it if it’s possible.

cheers,
sam

I don’t know what would have to change in the code to make this work, but there are two broader problems to think about as well.

First, loop parameters would need to be gated behind the multivalue target feature, but there are a bunch of bugs (e.g. clang crash when compiling to wasm32 with multi-value · Issue #58297 · llvm/llvm-project · GitHub) when using that feature right now, so those would have to be fixed too.

Second, and not strictly related to LLVM, the Binaryen optimizer currently has no support for block or loop parameters at all. For this feature to be used in toolchains like Emscripten, we would have to add support to Binaryen.

How much performance benefit id on the table here? That might motivate more work in this area.

Sorry for the delay, I’ve been rather ill.

The performance benefit will be wholly dependent on how many values are being used and the complexity of the loop. For the example that caught my eye, a simple vector add reduce loop, the stack operations at the beginning and end of the loop body are half the instructions. Added to that, the stores round the backedge appear to stall the loads at the start of the loop too. So, manually changing the loop increases performance by 2x.

Are the current multi-values bugs triaged somewhere? How much work do you think there is to get it working?