Hi,
Below you can see the updated benchmark results for the new SLP-vectorizer. As you can see, there is a small number of compile time regressions, a single major runtime *regression, and many performance gains. There is a tiny increase in code size: 30k for the whole test-suite. Based on the numbers below I would like to enable the SLP-vectorizer by default for -O3. Please let me know if you have any concerns.
Thanks,
Nadav
-
- I now understand the Olden/BH regression better. BH is slower because of a store-buffer stall. This means that the store buffer fills up and the CPU has to wait for some stores to finish. I can think of two reasons that may cause this problem. First, our vectorized stores are followed by a memcpy that’s expanded to a list of scalar-read/writes to the same addresses as the vector store. Maybe the processors can’t prune multiple stores to the same address with different sizes (Section 2.2.4 in the optimization guide has some info on this). Another possibility (less likely) is that we increase the critical path by adding a new pshufd instruction before the last vector store and that affects the store-buffer somehow. In any case, there is not much we can do at the IR-level to predict this.
Sorry for not posting sooner. I forgot to send an update with the results.
I also have some benchmark data. It confirms much of what you posted – binary size increase is essentially 0, performance increases across the board. It looks really good to me.
However, there was one crash that I’d like to check if it still fires. Will update later today (feel free to ping me if you don’t hear anything.).
That said, why -O3? I think we should just enable this across the board, as it doesn’t seem to cause any size regression under any mode, and the compile time hit is really low.
I agree. I think that it would be a good idea to enable it for -Os and -O2, but I’d like to make one step at a time.
Thanks,
Nadav
That said, why -O3? I think we should just enable this across the board, as it doesn’t seem to cause any size regression under any mode, and the compile time hit is really low.
I agree. I think that it would be a good idea to enable it for -Os and -O2, but I’d like to make one step at a time.
These results are really excellent. They’re on Intel, I assume, right? What do the ARM numbers look like? Before enabling by default, we should make sure that the results are comparable there as well.
-JIm
Nadav,
I ran some benchmarks and I’m seeing 3-6% performance loss on a few of them when we use SLP on both O2 and O3 (with O3 having the biggest differences).
Unfortunately, my benchmarking is not scientific, so I can’t vow for those numbers, nor I’ll have time to investigate it closer in the short term, but I wouldn’t be surprised if this is result of extra shuffles we were seeing a few months back on the BB-Vect. This means that we could maybe trim that off (later) by fixing two or three bugs and (fingers crossed) making those shuffles go away.
I’m trying to set up a task to compare the most important compile options (including all three vectorizers) on all optimization levels, but that’s not going to happen any time soon, so don’t take my word for it. If I’m the only one with bad numbers, I’m sure we can fix the issues later if you do introduce the SLP into O3 now.
Though, I would like to wait a bit more for O2 and Os, because I can’t also vow for its correctness, since we don’t have buildbots with the SLP on, nor on O2, and O2 is more or less the “fast, but still stable” state that people prefer to compile production code.
Turn it on on O3, let’s see how the bots behave, lets get a few data points and have a bit more information on its state.
cheers,
–renato
Hi Renato,
I am sorry for the delay in response. Thanks for running the benchmarks. Yi Jiang helped me to run some benchmarks and we got similar results. I noticed a single regression above 5% (5.5% in ClamAV). I am seeing much fewer performance gains on ARMv7s. I suppose that the vector codegen support on x86 is more mature. I will follow your suggestion and enable it by default only for -O3.
Thanks,
Nadav