An alternative way to resolve complex LSR solutions (need perf testing)

Hi All,

I've committed an alternative way to resolve complex (>UINT16
variants) LSR solutions under an option "-lsr-exp-narrow" (which is
temporary set to true by default):,
I'll turn the option to false when get your feedback.
The method is based on registers number mathematical expectation and
should be generally closer to optimal solution.
However, there could be corner cases, so please let me know if there
are gains/regression on your benchmarks.
The biggest performance changes are on x86 32 bits (as there are not
to much registers). On my benchmarks set there are more gains.

Compile time changes are also important (however I don't expect much
changes as complex solutions are not that frequent in hot loops).

Thanks in advance,


I originally missed this email, but we did notice the results in our internal benchmarks. Some results are up, some are down, as you might expect. A good place to start for results would be the LNT results here:

They show Shootout-C++/matrix-c++ down, and internally we see Shootout/matrix is down in more configurations. We have some other internal benchmarks with similar code patterns to the matrix results being down too. They look like they have similar nested for loop and similar access patterns.

We don't have compile time numbers handy, but the LNT results above seem to show them near the bottom. They can sometimes be noisy, but hopefully may be of help.